ComfyUI Extension: ComfyUI-HiggsAudio_Wrapper

Authored by ShmuelRonen

Created 21 days ago

Updated 21 days ago

16 stars

A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities. Supports multiple voice presets and custom reference audio for voice cloning. Requires transformers==4.45.2 for compatibility.

Custom Nodes (0)

README

ComfyUI-HiggsAudio_Wrapper

A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities.

Features

High-Quality Audio Generation: Leverages the powerful HiggsAudio v2 3B parameter model
Voice Cloning: Clone voices using reference audio or built-in voice presets
Multiple Voice Presets: Includes pre-configured voices (belinda, en_woman, en_man, etc.)
Flexible Audio Prioritization: Control whether to use voice presets or custom reference audio
Customizable System Prompts: Fine-tune audio generation with scene descriptions and style control
GPU Acceleration: Supports CUDA for faster generation
ComfyUI Integration: Seamless integration with ComfyUI workflows

Installation

Prerequisites

Python 3.8+
ComfyUI
CUDA-compatible GPU (recommended)

ComfyUI Installation

Clone this repository into your ComfyUI custom_nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-HiggsAudio_Wrapper.git

Install Dependencies

pip install -r requirements.txt

Restart ComfyUI
The nodes will appear under the "Higgs Audio" category

Usage

Basic Workflow

The wrapper provides several nodes that can be chained together:

Load Higgs Audio Model - Loads the generation model
Load Higgs Audio Tokenizer - Loads the audio tokenizer
Load Higgs Audio System Prompt - Configures generation style
Load Higgs Audio Prompt - Sets the text to convert to speech
Higgs Audio Generator - Performs the actual audio generation

Voice Cloning Options

Using Voice Presets

The wrapper includes several built-in voice presets:

belinda - Female voice
en_woman - English female voice
en_man - English male voice
mabel - Alternative female voice
vex - Character voice
chadwick - Male voice
broom_salesman - Character voice
zh_man_sichuan - Chinese male voice (Sichuan dialect)
voice_clone - Use custom reference 30 sec audio

Using Custom Reference Audio

Set voice preset to voice_clone
Connect reference audio to the reference_audio input
Optionally provide reference text that describes the audio

Audio Priority Settings

Control which audio source takes precedence:

auto (default) - Uses voice preset if selected, otherwise reference audio
preset_dropdown - Always prioritizes dropdown selection over reference audio
reference_input - Always prioritizes reference audio over dropdown
force_preset - Forces use of preset, ignoring reference audio completely

Configuration

What Actually Affects Audio Quality

Important: System prompts and scene descriptions have minimal effect on HiggsAudio output. Focus on these factors that actually work:

Voice Quality Control

Reference Audio: High-quality voice samples (24kHz+) with clear articulation
Voice Presets: Different presets have distinct characteristics - test to find the best fit
Reference Text: Clear, well-punctuated text that matches the reference audio

System Prompt (Minimal Impact)

Keep system prompts simple since complex scene descriptions are largely ignored:

Generate audio following instruction.

Generation Parameters

max_new_tokens (128-4096): Controls audio length and pacing
temperature (0.0-2.0): Controls voice consistency (0.8 = more stable, 1.2 = more varied)
top_p (0.1-1.0): Affects pronunciation variation (0.9-0.95 recommended)
top_k (-1-100): Fine-tunes voice characteristics (50 = default)
device: auto/cuda/cpu (auto = recommended)

File Structure

ComfyUI-HiggsAudio_Wrapper/
├── __init__.py                 # Node registration
├── nodes.py                    # Main node implementations
├── requirements.txt            # Python dependencies
├── voice_examples/             # Voice preset files
│   ├── config.json            # Voice preset configuration
│   ├── en_woman.wav           # Female English voice
│   ├── en_man.wav             # Male English voice
│   └── ...                    # Other voice presets
└── boson_multimodal/          # HiggsAudio engine
    └── ...

Realistic Expectations

What HiggsAudio Does Well

Voice Cloning: Excellent at replicating voice characteristics from reference audio
Speech Quality: Generates natural-sounding speech with good pronunciation
Multiple Voices: Built-in voice presets for different character types
Consistency: Maintains voice characteristics across longer text

Current Limitations

Scene Control: System prompts for acoustic environments (reverb, background sounds) have minimal effect
Emotional Control: Limited ability to control emotional expression through text prompts
Background Audio: Cannot generate environmental sounds or music
Real-time: Requires processing time, not suitable for real-time applications

Best Use Cases

Voice-over generation with consistent character voices
Audiobook narration with cloned voices
Character voices for games or animations
Text-to-speech with specific voice characteristics

For acoustic effects like reverb or background sounds, consider post-processing with audio editing software.

Troubleshooting

Common Issues

Poor Audio Quality

Use higher quality reference audio (24kHz+ recommended)
Try different voice presets to find the best match
Adjust temperature (0.8 for stability, 1.2 for variation)
Ensure reference text matches the reference audio content

"audio_base64 is None" Error

Ensure reference audio is properly formatted
Check that voice preset files exist in voice_examples/
Verify audio file is not corrupted

Inconsistent Voice Output

Lower the temperature parameter (try 0.8)
Use higher quality reference audio
Ensure reference audio has consistent background noise levels

CUDA Out of Memory

Reduce max_new_tokens
Use device: cpu instead of auto/cuda
Close other GPU-intensive applications

Model Loading Issues

Ensure stable internet connection for model download
Check available disk space (models are several GB)
Verify transformers version compatibility

Performance Tips

First Run: Model downloading may take time
GPU Memory: 8GB+ VRAM recommended for optimal performance
Caching: Models are cached after first load for faster subsequent runs
Voice Quality: Use high-quality reference audio for best results
Parameter Tuning: Lower temperature (0.8) for consistent voice, higher (1.2) for variation
Text Formatting: Use proper punctuation for natural speech rhythm

API Reference

HiggsAudio Node Inputs

Required

MODEL_PATH: Path to HiggsAudio model
AUDIO_TOKENIZER_PATH: Path to audio tokenizer
system_prompt: System prompt for generation control
prompt: Text to convert to speech
max_new_tokens: Maximum tokens to generate
temperature: Sampling temperature
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter
device: Computation device

Optional

voice_preset: Voice preset selection
reference_audio: Custom reference audio
reference_text: Text corresponding to reference audio
audio_priority: Audio source prioritization

Output

output: Generated audio in ComfyUI format
used_voice_info: Information about which voice source was used

Requirements

See requirements.txt for complete list:

torch==2.5.1
torchaudio==2.5.1
transformers>=4.45.1,<4.47.0
librosa
And others...

Third-Party Licenses

The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE in that directory for complete attribution and licensing information.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Support

For issues and questions:

Open an issue on GitHub
Check existing issues for solutions
Provide detailed error messages and system information

Acknowledgments

HiggsAudio team for the underlying model
ComfyUI community for the framework
Contributors and testers

Note: This wrapper requires significant computational resources. A CUDA-compatible GPU with 8GB+ VRAM is recommended for optimal performance.