ComfyUI Extension: ComfyUI-HiggsAudio_Wrapper
A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities. Supports multiple voice presets and custom reference audio for voice cloning. Requires transformers==4.45.2 for compatibility.
Custom Nodes (0)
README
ComfyUI-HiggsAudio_Wrapper
A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities.
<img width="2619" height="1468" alt="image" src="https://github.com/user-attachments/assets/7cfd3e77-3481-43cc-a821-fc28837fca29" />Features
- High-Quality Audio Generation: Leverages the powerful HiggsAudio v2 3B parameter model
- Voice Cloning: Clone voices using reference audio or built-in voice presets
- Multiple Voice Presets: Includes pre-configured voices (belinda, en_woman, en_man, etc.)
- Flexible Audio Prioritization: Control whether to use voice presets or custom reference audio
- Customizable System Prompts: Fine-tune audio generation with scene descriptions and style control
- GPU Acceleration: Supports CUDA for faster generation
- ComfyUI Integration: Seamless integration with ComfyUI workflows
Installation
Prerequisites
- Python 3.8+
- ComfyUI
- CUDA-compatible GPU (recommended)
ComfyUI Installation
- Clone this repository into your ComfyUI
custom_nodes
directory:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-HiggsAudio_Wrapper.git
Install Dependencies
pip install -r requirements.txt
-
Restart ComfyUI
-
The nodes will appear under the "Higgs Audio" category
Usage
Basic Workflow
The wrapper provides several nodes that can be chained together:
- Load Higgs Audio Model - Loads the generation model
- Load Higgs Audio Tokenizer - Loads the audio tokenizer
- Load Higgs Audio System Prompt - Configures generation style
- Load Higgs Audio Prompt - Sets the text to convert to speech
- Higgs Audio Generator - Performs the actual audio generation
Voice Cloning Options
Using Voice Presets
The wrapper includes several built-in voice presets:
belinda
- Female voiceen_woman
- English female voiceen_man
- English male voicemabel
- Alternative female voicevex
- Character voicechadwick
- Male voicebroom_salesman
- Character voicezh_man_sichuan
- Chinese male voice (Sichuan dialect)voice_clone
- Use custom reference 30 sec audio
Using Custom Reference Audio
- Set voice preset to
voice_clone
- Connect reference audio to the
reference_audio
input - Optionally provide reference text that describes the audio
Audio Priority Settings
Control which audio source takes precedence:
auto
(default) - Uses voice preset if selected, otherwise reference audiopreset_dropdown
- Always prioritizes dropdown selection over reference audioreference_input
- Always prioritizes reference audio over dropdownforce_preset
- Forces use of preset, ignoring reference audio completely
Configuration
What Actually Affects Audio Quality
Important: System prompts and scene descriptions have minimal effect on HiggsAudio output. Focus on these factors that actually work:
Voice Quality Control
- Reference Audio: High-quality voice samples (24kHz+) with clear articulation
- Voice Presets: Different presets have distinct characteristics - test to find the best fit
- Reference Text: Clear, well-punctuated text that matches the reference audio
System Prompt (Minimal Impact)
Keep system prompts simple since complex scene descriptions are largely ignored:
Generate audio following instruction.
Generation Parameters
- max_new_tokens (128-4096): Controls audio length and pacing
- temperature (0.0-2.0): Controls voice consistency (0.8 = more stable, 1.2 = more varied)
- top_p (0.1-1.0): Affects pronunciation variation (0.9-0.95 recommended)
- top_k (-1-100): Fine-tunes voice characteristics (50 = default)
- device: auto/cuda/cpu (auto = recommended)
File Structure
ComfyUI-HiggsAudio_Wrapper/
├── __init__.py # Node registration
├── nodes.py # Main node implementations
├── requirements.txt # Python dependencies
├── voice_examples/ # Voice preset files
│ ├── config.json # Voice preset configuration
│ ├── en_woman.wav # Female English voice
│ ├── en_man.wav # Male English voice
│ └── ... # Other voice presets
└── boson_multimodal/ # HiggsAudio engine
└── ...
Realistic Expectations
What HiggsAudio Does Well
- Voice Cloning: Excellent at replicating voice characteristics from reference audio
- Speech Quality: Generates natural-sounding speech with good pronunciation
- Multiple Voices: Built-in voice presets for different character types
- Consistency: Maintains voice characteristics across longer text
Current Limitations
- Scene Control: System prompts for acoustic environments (reverb, background sounds) have minimal effect
- Emotional Control: Limited ability to control emotional expression through text prompts
- Background Audio: Cannot generate environmental sounds or music
- Real-time: Requires processing time, not suitable for real-time applications
Best Use Cases
- Voice-over generation with consistent character voices
- Audiobook narration with cloned voices
- Character voices for games or animations
- Text-to-speech with specific voice characteristics
For acoustic effects like reverb or background sounds, consider post-processing with audio editing software.
Troubleshooting
Common Issues
Poor Audio Quality
- Use higher quality reference audio (24kHz+ recommended)
- Try different voice presets to find the best match
- Adjust temperature (0.8 for stability, 1.2 for variation)
- Ensure reference text matches the reference audio content
"audio_base64 is None" Error
- Ensure reference audio is properly formatted
- Check that voice preset files exist in
voice_examples/
- Verify audio file is not corrupted
Inconsistent Voice Output
- Lower the temperature parameter (try 0.8)
- Use higher quality reference audio
- Ensure reference audio has consistent background noise levels
CUDA Out of Memory
- Reduce
max_new_tokens
- Use
device: cpu
instead of auto/cuda - Close other GPU-intensive applications
Model Loading Issues
- Ensure stable internet connection for model download
- Check available disk space (models are several GB)
- Verify transformers version compatibility
Performance Tips
- First Run: Model downloading may take time
- GPU Memory: 8GB+ VRAM recommended for optimal performance
- Caching: Models are cached after first load for faster subsequent runs
- Voice Quality: Use high-quality reference audio for best results
- Parameter Tuning: Lower temperature (0.8) for consistent voice, higher (1.2) for variation
- Text Formatting: Use proper punctuation for natural speech rhythm
API Reference
HiggsAudio Node Inputs
Required
MODEL_PATH
: Path to HiggsAudio modelAUDIO_TOKENIZER_PATH
: Path to audio tokenizersystem_prompt
: System prompt for generation controlprompt
: Text to convert to speechmax_new_tokens
: Maximum tokens to generatetemperature
: Sampling temperaturetop_p
: Nucleus sampling parametertop_k
: Top-k sampling parameterdevice
: Computation device
Optional
voice_preset
: Voice preset selectionreference_audio
: Custom reference audioreference_text
: Text corresponding to reference audioaudio_priority
: Audio source prioritization
Output
output
: Generated audio in ComfyUI formatused_voice_info
: Information about which voice source was used
Requirements
See requirements.txt
for complete list:
- torch==2.5.1
- torchaudio==2.5.1
- transformers>=4.45.1,<4.47.0
- librosa
- And others...
Third-Party Licenses
The boson_multimodal/audio_processing/
directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE
in that directory for complete attribution and licensing information.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Support
For issues and questions:
- Open an issue on GitHub
- Check existing issues for solutions
- Provide detailed error messages and system information
Acknowledgments
- HiggsAudio team for the underlying model
- ComfyUI community for the framework
- Contributors and testers
Note: This wrapper requires significant computational resources. A CUDA-compatible GPU with 8GB+ VRAM is recommended for optimal performance.