ComfyUI Extension: ComfyUI Zonos TTS Node

Authored by BahaC

Created

Updated

9 stars

A ComfyUI custom node that brings Zonos Text-to-Speech capabilities to your workflows, featuring high-quality speech synthesis and voice cloning.

Custom Nodes (0)

    README

    ComfyUI Zonos TTS Node

    A ComfyUI custom node that brings Zonos Text-to-Speech capabilities to your workflows, featuring high-quality speech synthesis and voice cloning.

    Features

    • 🎯 High-quality text-to-speech synthesis
    • 🗣️ Voice cloning from reference audio
    • 💾 Local model caching for faster loading
    • 🎚️ Advanced parameter control for speech generation
    • 🌍 Support for English, Japanese and many other languages.
    • ⚡ Multiple model architectures (Transformer/Hybrid)

    Installation

    1. Clone this repository into your ComfyUI custom nodes directory:
    cd ComfyUI/custom_nodes/
    git clone https://github.com/BahaC/ComfyUI-ZonosTTS.git
    
    1. Install the requirements:
    cd ComfyUI-ZonosTTS
    pip install -r requirements.txt
    

    Node Usage

    Zonos Text to Speech

    The node provides a simple interface for text-to-speech conversion with advanced options:

    Inputs

    • text: Input text to synthesize (String)
    • language: Language code selection (en-us, ja-jp)
    • model_name: Choice of model architecture:
      • Zyphra/Zonos-v0.1-transformer: Faster, lighter model
      • Zyphra/Zonos-v0.1-hybrid: Higher quality (requires additional dependencies)
    • audio_file: Reference audio for voice cloning (optional)
    • cfg_scale: Control over generation quality (1.0 - 10.0)

    Output

    • audio_path: Path to the generated WAV file

    Model Management

    Models are automatically downloaded and cached in:

    /workspace/ComfyUI/models/TTS/Zonos/
    

    The node implements smart model caching:

    • First run: Downloads and caches the model
    • Subsequent runs: Uses cached model for faster loading
    • Automatic model switching when changing architectures

    Example Workflows

    Basic Text to Speech

    [Text Input] -> [Zonos TTS] -> [Audio Output]
    

    Voice Cloning

    [Text Input] -> [Zonos TTS] <- [Audio File] == [Audio File]
    

    Configuration

    Audio Output

    Generated audio files are saved with unique timestamps:

    output/zonos_YYYYMMDD-HHMMSS_UUID.wav
    

    Model Settings

    • Transformer Model

      • Faster inference
      • Lower resource requirements
      • Good for most use cases
    • Hybrid Model

      • Higher quality output
      • Requires additional dependencies
      • More resource intensive

    Requirements

    • Python >= 3.10
    • torch >= 2.0.0
    • torchaudio >= 2.0.0
    • safetensors >= 0.3.0
    • huggingface_hub >= 0.16.0
    • Additional dependencies in requirements.txt

    Troubleshooting

    Common Issues

    1. Model Download Fails

      • Check your internet connection
      • Ensure you have sufficient disk space
      • Try manually downloading to the models directory
    2. Voice Cloning Issues

      • Ensure reference audio is clean and contains only speech
      • Use WAV format for reference audio
      • Keep reference audio under 30 seconds
    3. CUDA Out of Memory

      • Try using the transformer model instead of hybrid
      • Reduce batch size or audio length
      • Free up GPU memory from other applications

    Credits

    License

    This project is licensed under the terms of the LICENSE file included in the repository.