ComfyUI Extension: ComfyUI-HiggsAudio_Wrapper
A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities. Supports multiple voice presets and custom reference audio for voice cloning. Requires transformers==4.45.2 for compatibility.
Custom Nodes (0)
README
ComfyUI-HiggsAudio_Wrapper
A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities.
<img width="2619" height="1468" alt="image" src="https://github.com/user-attachments/assets/7cfd3e77-3481-43cc-a821-fc28837fca29" />Features
- High-Quality Audio Generation: Leverages the powerful HiggsAudio v2 3B parameter model
- Voice Cloning: Clone voices using reference audio or built-in voice presets
- Multiple Voice Presets: Includes pre-configured voices (belinda, en_woman, en_man, etc.)
- Flexible Audio Prioritization: Control whether to use voice presets or custom reference audio
- Customizable System Prompts: Fine-tune audio generation with scene descriptions and style control
- GPU Acceleration: Supports CUDA for faster generation
- ComfyUI Integration: Seamless integration with ComfyUI workflows
Installation
Prerequisites
- Python 3.8+
- ComfyUI
- CUDA-compatible GPU (recommended)
ComfyUI Installation
- Clone this repository into your ComfyUI
custom_nodesdirectory:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-HiggsAudio_Wrapper.git
Install Dependencies
pip install -r requirements.txt
-
Restart ComfyUI
-
The nodes will appear under the "Higgs Audio" category
Usage
Basic Workflow
The wrapper provides several nodes that can be chained together:
- Load Higgs Audio Model - Loads the generation model
- Load Higgs Audio Tokenizer - Loads the audio tokenizer
- Load Higgs Audio System Prompt - Configures generation style
- Load Higgs Audio Prompt - Sets the text to convert to speech
- Higgs Audio Generator - Performs the actual audio generation
Voice Cloning Options
Using Voice Presets
The wrapper includes several built-in voice presets:
belinda- Female voiceen_woman- English female voiceen_man- English male voicemabel- Alternative female voicevex- Character voicechadwick- Male voicebroom_salesman- Character voicezh_man_sichuan- Chinese male voice (Sichuan dialect)voice_clone- Use custom reference 30 sec audio
Using Custom Reference Audio
- Set voice preset to
voice_clone - Connect reference audio to the
reference_audioinput - Optionally provide reference text that describes the audio
Audio Priority Settings
Control which audio source takes precedence:
auto(default) - Uses voice preset if selected, otherwise reference audiopreset_dropdown- Always prioritizes dropdown selection over reference audioreference_input- Always prioritizes reference audio over dropdownforce_preset- Forces use of preset, ignoring reference audio completely
Configuration
What Actually Affects Audio Quality
Important: System prompts and scene descriptions have minimal effect on HiggsAudio output. Focus on these factors that actually work:
Voice Quality Control
- Reference Audio: High-quality voice samples (24kHz+) with clear articulation
- Voice Presets: Different presets have distinct characteristics - test to find the best fit
- Reference Text: Clear, well-punctuated text that matches the reference audio
System Prompt (Minimal Impact)
Keep system prompts simple since complex scene descriptions are largely ignored:
Generate audio following instruction.
Generation Parameters
- max_new_tokens (128-4096): Controls audio length and pacing
- temperature (0.0-2.0): Controls voice consistency (0.8 = more stable, 1.2 = more varied)
- top_p (0.1-1.0): Affects pronunciation variation (0.9-0.95 recommended)
- top_k (-1-100): Fine-tunes voice characteristics (50 = default)
- device: auto/cuda/cpu (auto = recommended)
File Structure
ComfyUI-HiggsAudio_Wrapper/
├── __init__.py # Node registration
├── nodes.py # Main node implementations
├── requirements.txt # Python dependencies
├── voice_examples/ # Voice preset files
│ ├── config.json # Voice preset configuration
│ ├── en_woman.wav # Female English voice
│ ├── en_man.wav # Male English voice
│ └── ... # Other voice presets
└── boson_multimodal/ # HiggsAudio engine
└── ...
Realistic Expectations
What HiggsAudio Does Well
- Voice Cloning: Excellent at replicating voice characteristics from reference audio
- Speech Quality: Generates natural-sounding speech with good pronunciation
- Multiple Voices: Built-in voice presets for different character types
- Consistency: Maintains voice characteristics across longer text
Current Limitations
- Scene Control: System prompts for acoustic environments (reverb, background sounds) have minimal effect
- Emotional Control: Limited ability to control emotional expression through text prompts
- Background Audio: Cannot generate environmental sounds or music
- Real-time: Requires processing time, not suitable for real-time applications
Best Use Cases
- Voice-over generation with consistent character voices
- Audiobook narration with cloned voices
- Character voices for games or animations
- Text-to-speech with specific voice characteristics
For acoustic effects like reverb or background sounds, consider post-processing with audio editing software.
Troubleshooting
Common Issues
Poor Audio Quality
- Use higher quality reference audio (24kHz+ recommended)
- Try different voice presets to find the best match
- Adjust temperature (0.8 for stability, 1.2 for variation)
- Ensure reference text matches the reference audio content
"audio_base64 is None" Error
- Ensure reference audio is properly formatted
- Check that voice preset files exist in
voice_examples/ - Verify audio file is not corrupted
Inconsistent Voice Output
- Lower the temperature parameter (try 0.8)
- Use higher quality reference audio
- Ensure reference audio has consistent background noise levels
CUDA Out of Memory
- Reduce
max_new_tokens - Use
device: cpuinstead of auto/cuda - Close other GPU-intensive applications
Model Loading Issues
- Ensure stable internet connection for model download
- Check available disk space (models are several GB)
- Verify transformers version compatibility
Performance Tips
- First Run: Model downloading may take time
- GPU Memory: 8GB+ VRAM recommended for optimal performance
- Caching: Models are cached after first load for faster subsequent runs
- Voice Quality: Use high-quality reference audio for best results
- Parameter Tuning: Lower temperature (0.8) for consistent voice, higher (1.2) for variation
- Text Formatting: Use proper punctuation for natural speech rhythm
API Reference
HiggsAudio Node Inputs
Required
MODEL_PATH: Path to HiggsAudio modelAUDIO_TOKENIZER_PATH: Path to audio tokenizersystem_prompt: System prompt for generation controlprompt: Text to convert to speechmax_new_tokens: Maximum tokens to generatetemperature: Sampling temperaturetop_p: Nucleus sampling parametertop_k: Top-k sampling parameterdevice: Computation device
Optional
voice_preset: Voice preset selectionreference_audio: Custom reference audioreference_text: Text corresponding to reference audioaudio_priority: Audio source prioritization
Output
output: Generated audio in ComfyUI formatused_voice_info: Information about which voice source was used
Requirements
See requirements.txt for complete list:
- torch==2.5.1
- torchaudio==2.5.1
- transformers>=4.45.1,<4.47.0
- librosa
- And others...
Third-Party Licenses
The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE in that directory for complete attribution and licensing information.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Support
For issues and questions:
- Open an issue on GitHub
- Check existing issues for solutions
- Provide detailed error messages and system information
Acknowledgments
- HiggsAudio team for the underlying model
- ComfyUI community for the framework
- Contributors and testers
Note: This wrapper requires significant computational resources. A CUDA-compatible GPU with 8GB+ VRAM is recommended for optimal performance.