ComfyUI Extension: TTS Audio Suite

Authored by diodiogod

Created

Updated

23 stars

TTS Audio Suite - Universal multi-engine TTS extension for ComfyUI with unified architecture supporting ChatterBox, F5-TTS, and future engines like RVC. Features modular engine adapters, character voice management, comprehensive SRT subtitle support, and advanced audio processing capabilities.

Custom Nodes (0)

    README

    <a id="readme-top"></a>

    Stargazers Issues Forks Dynamic TOML Badge

    TTS Audio Suite v4.5.25

    Universal multi-engine TTS extension for ComfyUI - evolved from the original ChatterBox Voice project.

    <div align="center"> <img src="images/AllNodesShowcase.png" alt="TTS Audio Suite Nodes Showcase" /> </div>

    A comprehensive ComfyUI extension providing unified Text-to-Speech and Voice Conversion capabilities through multiple engines including ChatterboxTTS, F5-TTS, Higgs Audio 2, and RVC (Real-time Voice Conversion), with modular architecture designed for extensibility and future engine integrations.

    πŸš€ Project Evolution Timeline

    🎭 ChatterBox Voice Era                       🌟 Multi-Engine Era              
    |                                      Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β |                                
    v1.0 ───────────► v1.1 ────────► v2.0 ──────────► v3.0 ─────────┐
    Jun 25          Β Β Jun 25 Β Β Β Β Β Β Β Β Jun 25      Β Β Β Β Β Jul 25        β”‚
    β”‚              Β Β Β β”‚  Β Β Β Β Β Β Β Β Β Β Β Β β”‚           Β Β Β Β Β β”‚             β”‚
    Foundation      Β Β SRTΒ Β Β Β Β Β Β Β Β Β Β  Modular    Β Β Β Β Β Β F5-TTS +      β”‚
    ChatterBox      Β Β SubtitlesΒ Β Β Β Β  Structure  Β  Β Β Β Β Audio         β”‚
    Voice Cloning   Β Β Timing NodeΒ Β Β Β Refactor    Β Β Β Β Β Analyzer      β”‚
                                                        Β Β Β Β Β Β Β Β Β Β Β Β β–Ό
    v3.4 ◄──────────────── v3.2 ◄──────────────── v3.1 β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    Jul 25        Β Β Β Β Β Β Β Β  Jul 25        Β Β Β Β Β Β Β Β  Jul 25              
    β”‚           Β Β Β Β Β Β Β Β    β”‚            Β Β Β Β Β Β Β Β   β”‚                   
    Language     Β Β Β Β Β Β Β Β   Pause         Β Β Β Β Β Β Β Β  Character           
    Switching     Β Β Β Β Β Β Β Β  Tags          Β Β Β Β Β Β Β Β  Switching           
    [German:Bob]  Β Β Β Β Β Β Β Β  [pause:1s]    Β Β Β Β Β Β Β Β  [Alice]             
    β”‚                                                  
    β”‚         βš™οΈ TTS Audio Suite Era                                                  
    β–Ό         |                                   
    v4.0 ──────────► v4.3 ──────────► v4.4 ──────────► v4.5
    Aug 25           Aug 25           Aug 25           Aug 25
    β”‚                β”‚                β”‚                β”‚
    ⚠️BREAKING        RVC +            Silent           Higgs Audio 2
    Project          Voice            Speech           3rd Engine
    Renamed          Conversion       Analyzer         Voice Cloning
    TTS Audio Suite  + Streaming
    
    <details> <summary><h2>πŸ“‹ Table of Contents</h2></summary> </details>

    πŸŽ₯ Demo Videos

    <div align="center"> <a href="https://youtu.be/aHz1mQ2bvEY"> <img src="https://img.youtube.com/vi/aHz1mQ2bvEY/maxresdefault.jpg" width="400" alt="ChatterBox SRT Voice v3.2 - F5-TTS Integration & Features Overview"> </a> <br> <strong><a href="https://youtu.be/aHz1mQ2bvEY">▢️ v3.2 Features Overview (20min) - F5-TTS Integration, Speech Editor & More!</a></strong> </div> <br> <div align="center"> <a href="https://youtu.be/VyOawMrCB1g?si=7BubljRhsudGqG3s"> <img src="https://img.youtube.com/vi/VyOawMrCB1g/maxresdefault.jpg" width="400" alt="ChatterBox SRT Voice Demo"> </a> <br> <strong><a href="https://youtu.be/VyOawMrCB1g?si=7BubljRhsudGqG3s">▢️ Original Demo - SRT Timing & Basic Features</a></strong> </div> <details> <summary><h3>πŸ“œ Original ShmuelRonen ChatterBox TTS Nodes</h3></summary> <div align="center"> <img src="https://github.com/user-attachments/assets/4197818c-8093-4da4-abd5-577943ac902c" width="45%" alt="ChatterBox TTS Nodes" /> <img src="https://github.com/user-attachments/assets/701c219b-12ff-4567-b414-e58560594ffe" width="45%" alt="ChatterBox Voice Capture" /> </div>
    • Voice Recording: Smart silence detection for voice capture
    • Enhanced Chunking: Intelligent text splitting with multiple combination methods
    • Unlimited Text Length: No character limits with smart processing

    Original creator: ShmuelRonen

    </details> <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Features

    • 🎀 Multi-Engine TTS - ChatterBox TTS, F5-TTS, and Higgs Audio 2 with voice cloning, reference audio synthesis, and production-grade quality
    • πŸŽ™οΈ Higgs Audio 2 Voice Cloning - State-of-the-art voice cloning with 30+ second reference audio and multi-speaker conversation support
    • πŸ”„ Voice Conversion - ChatterBox VC with iterative refinement + RVC real-time conversion using .pth character models
    • πŸŽ™οΈ Voice Capture & Recording - Smart silence detection and voice input recording
    • 🎭 Character & Language Switching - Multi-character TTS with [CharacterName] tags, alias system, and [language:character] syntax for seamless model switching
    • 🌍 Multi-language Support - ChatterBox (English, German, Norwegian) + F5-TTS (English, German, Spanish, French, Japanese, Hindi, and more)
    • 😀 Emotion Control - Unique exaggeration parameter for expressive speech
    • πŸ“ Enhanced Chunking - Intelligent text splitting for long content with multiple combination methods
    • 🎡 Advanced Audio Processing - Optional FFmpeg support for premium audio quality with graceful fallback
    • 🀐 Vocal/Noise Removal - AI-powered vocal separation, noise reduction, and echo removal with GPU acceleration β†’ πŸ“– Complete Guide
    • 🌊 Audio Wave Analyzer - Interactive waveform visualization and precise timing extraction for F5-TTS workflows β†’ πŸ“– Complete Guide
    • πŸ—£οΈ Silent Speech Analyzer - Video analysis with experimental viseme detection, mouth movement tracking, and base SRT timing generation from silent video using MediaPipe
    • βš™οΈ Parallel Processing - Configurable worker-based processing via batch_size parameter (Note: sequential processing with batch_size=0 remains optimal for performance)
    <div align="right"><a href="#-table-of-contents">Back to top</a></div> <details> <summary><h2>πŸ†• What's New in my Project?</h2></summary> <details> <summary><h3>πŸ“Ί SRT Timing and TTS Node</h3></summary> <img title="" src="images/srt.png" alt="SRT Node Screenshot" width="500" data-align="center">

    The "ChatterBox SRT Voice TTS" node allows TTS generation by processing SRT content (SubRip Subtitle) files, ensuring precise timing and synchronization with your audio.

    Key SRT Features:

    • SRT style Processing: Uses SRT style to generate TTS, aligning audio with subtitle timings
    • smart_natural Timing Mode: Intelligent shifting logic that prevents overlaps and ensures natural speech flow
    • Adjusted_SRT Output: Provides actual timings for generated audio for accurate post-processing
    • Segment-Level Caching: Only regenerates modified segments, significantly speeding up workflows

    For comprehensive technical information, refer to the SRT_IMPLEMENTATION.md file.

    </details> <details> <summary><h3>πŸ†• F5-TTS Integration and πŸ†• Audio Analyzer</h3></summary> <img title="" src="images/waveanalgif.gif" alt="Audio Wave gif" width="500" data-align="center">
    • F5-TTS Voice Synthesis: High-quality voice cloning with reference audio + text
    • Audio Wave Analyzer: Interactive waveform visualization for precise timing extraction
    • Multi-language Support: English, German, Spanish, French, Japanese models
    • Speech Editing Workflows: Advanced F5-TTS editing capabilities
    </details> <details> <summary><h3>πŸ—£οΈ Silent Speech Analyzer</h3></summary>

    NEW in v4.4.0: Video analysis and mouth movement detection for silent video processing!

    • Mouth Movement Analysis: Real-time detection of mouth shapes and movements from video
    • Experimental Viseme Classification: Approximate detection of vowels (A, E, I, O, U) and consonants (B, F, M, etc.) - results are experimental approximations, not precise
    • 3-Level Analysis System:
      • Frame-level mouth movement detection
      • Syllable grouping with temporal analysis
      • Word prediction using CMU Pronouncing Dictionary (135K+ words)
    • Base SRT Generation: Creates timing-focused SRT files with start/end speech timing as foundation for user editing
    • MediaPipe Integration: Production-ready analysis using Google's MediaPipe framework
    • Visual Feedback: Preview videos with overlaid detection results
    • Automatic Phonetic Placeholders: Word predictions provide phonetically-sensible placeholders, but phrases require user editing for meaningful content
    • TTS Integration: SRT output designed for use with TTS SRT nodes after manual content editing

    Perfect for:

    • Creating base timing templates from silent video footage
    • Animation and VFX reference timing
    • Foundation for manual subtitle creation

    Important Notes:

    • OpenSeeFace provider is experimental and not recommended for production use - MediaPipe is the stable solution
    • Viseme detection is experimental approximation - expect to manually edit both timing and content
    • Generated text placeholders are phonetic suggestions, not meaningful sentences
    </details> <details> <summary><h3>πŸŽ™οΈ Higgs Audio 2 Voice Cloning</h3></summary>

    NEW in v4.5.0: State-of-the-art voice cloning technology with advanced neural voice replication!

    • High-Quality Voice Cloning: Clone any voice from 30+ second reference audio with exceptional fidelity
    • Multi-Speaker Conversations: Native support for character switching within conversations
    • Real-Time Processing: Generate speech in cloned voices with minimal latency
    • Universal Integration: Works seamlessly with existing TTS Text and TTS SRT nodes

    Key Capabilities:

    • Voice Cloning from Reference Audio: Upload any 30+ second audio file for voice replication
    • Multi-Language Support: English (tested), with potential support for Chinese, Korean, German, and Spanish (based on model training data)
    • Character Switching: Use [CharacterName] syntax for multi-speaker dialogues
    • Advanced Generation Control: Fine-tune temperature, top-p, top-k, and token limits
    • Smart Chunking: Automatic handling of unlimited text length with seamless audio combination
    • Intelligent Caching: Instant regeneration of previously processed content

    Technical Features:

    • Modular Architecture: Clean integration with unified TTS system
    • Automatic Model Management: Downloads and organizes models in ComfyUI/models/TTS/HiggsAudio/ structure
    • Progress Tracking: Real-time generation feedback with tqdm progress bars
    • Voice Reference Discovery: Flexible voice file management system

    Quick Start:

    1. Add Higgs Audio Engine node to configure voice cloning parameters
    2. Connect to TTS Text or TTS SRT node for generation
    3. Specify reference audio file or use voice discovery system
    4. Generate high-quality cloned speech with automatic optimization

    Perfect for:

    • Voice acting and character dialogue creation
    • Audiobook narration with consistent voice characteristics
    • Multi-speaker content with distinct voice personalities
    • Professional voice replication for content creation
    </details> <details> <summary><h3>🎭 Character & Narrator Switching</h3></summary>

    NEW in v3.1.0: Seamless character switching for both F5TTS and ChatterBox engines!

    • Multi-Character Support: Use [CharacterName] tags to switch between different voices
    • Voice Folder Integration: Organized character voice management system
    • 🏷️ Character Aliases: User-friendly alias system - use [Alice] instead of [female_01] with #character_alias_map.txt
    • Robust Fallback: Graceful handling when characters not found (no errors!)
    • Universal Compatibility: Works with both F5TTS and ChatterBox TTS engines
    • SRT Integration: Character switching within subtitle timing
    • Backward Compatible: Existing workflows work unchanged

    πŸ“– Complete Character Switching Guide

    Example usage:

    Hello! This is the narrator speaking.
    [Alice] Hi there! I'm Alice, nice to meet you.
    [Bob] And I'm Bob! Great to meet you both.
    Back to the narrator for the conclusion.
    
    </details> <details> <summary><h3>🌍 Language Switching with Bracket Syntax</h3></summary>

    NEW in v3.4.0: Seamless language switching using simple bracket notation!

    • Language Code Syntax: Use [language:character] tags to switch languages and models automatically
    • Smart Model Loading: Automatically loads correct language models (F5-DE, F5-FR, German, Norwegian, etc.)
    • Flexible Aliases (v3.4.3): Support for [German:Alice], [Brazil:Bob], [USA:], [Portugal:] - no need to remember language codes!
    • Standard Format: Also supports traditional [fr:Alice], [de:Bob], or [es:] (language only) patterns
    • Character Integration: Combines perfectly with character switching and alias system
    • Performance Optimized: Language groups processed efficiently to minimize model switching
    • Alias Support: Language defaults work with character alias system

    Supported Languages:

    • F5-TTS: English (en), German (de), Spanish (es), French (fr), Italian (it), Japanese (jp), Thai (th), Portuguese (pt), Hindi (hi)
    • ChatterBox: English (en), German (de), Norwegian (no/nb/nn)

    Example usage:

    Hello! This is English text with the default model.
    [de:Alice] Hallo! Ich spreche Deutsch mit Alice's Stimme.
    [fr:] Bonjour! Je parle franΓ§ais avec la voix du narrateur.
    [es:Bob] Β‘Hola! Soy Bob hablando en espaΓ±ol.
    Back to English with the original model.
    

    Advanced SRT Integration:

    1
    00:00:01,000 --> 00:00:04,000
    Hello! Welcome to our multilingual show.
    
    2
    00:00:04,500 --> 00:00:08,000
    [de:female_01] Willkommen zu unserer mehrsprachigen Show!
    
    3
    00:00:08,500 --> 00:00:12,000
    [fr:] Bienvenue Γ  notre Γ©mission multilingue!
    
    </details> <details> <summary><h3>πŸ”„ Iterative Voice Conversion</h3></summary>

    NEW: Progressive voice refinement with intelligent caching for instant experimentation!

    • Refinement Passes: Multiple conversion iterations (1-30, recommended 1-5)
    • Smart Caching: Results cached up to 5 iterations - change from 5β†’3β†’4 passes instantly
    • Progressive Quality: Each pass refines output to sound more like target voice
    </details> <details> <summary><h3>🎡 RVC Voice Conversion Integration</h3></summary>

    NEW in v4.1.0: Professional-grade Real-time Voice Conversion with .pth character models!

    • RVC Character Models: Load .pth voice models with 🎭 Load RVC Character Model node
    • Unified Voice Changer: Full RVC integration in the Voice Changer node
    • Iterative Refinement: 1-30 passes with smart caching (like ChatterBox)
    • Enhanced Quality: Automatic .index file loading for improved voice similarity
    • Auto-Download: Required models download from official sources automatically
    • Cache Intelligence: Skip recomputation - change 5β†’3β†’4 passes instantly
    • Neural Network Quality: High-quality voice conversion using trained RVC models

    πŸ“– See RVC Models Setup for detailed installation guide

    How it works:

    1. Load your .pth RVC model with 🎭 Load RVC Character Model
    2. Connect to πŸ”„ Voice Changer, select "RVC" engine
    3. Process with iterative refinement for progressive quality improvement
    4. Results cached for instant experimentation with different pass counts
    </details> <details> <summary><h3>⏸️ Pause Tags System</h3></summary>

    NEW: Intelligent pause insertion for natural speech timing control!

    • Smart Pause Syntax: Use pause tags anywhere in your text with multiple aliases
    • Flexible Duration Formats:
      • Seconds: [pause:1.5], [wait:2s], [stop:3]
      • Milliseconds: [pause:500ms], [wait:1200ms], [stop:800ms]
      • Supported aliases: pause, wait, stop (all work identically)
    • Character Integration: Pause tags work seamlessly with character switching
    • Intelligent Caching: Changing pause durations won't regenerate unchanged text segments
    • Universal Support: Works across all TTS nodes (ChatterBox, F5-TTS, SRT)
    • Automatic Processing: No additional parameters needed - just add tags to your text

    Example usage:

    Welcome to our show! [pause:1s] Today we'll discuss exciting topics.
    [Alice] I'm really excited! [wait:500ms] This will be great.
    [stop:2] Let's get started with the main content.
    
    </details> <details> <summary><h3>🌍 Multi-language ChatterBox Support</h3></summary>

    NEW in v3.3.0: ChatterBox TTS and SRT nodes now support multiple languages with automatic model management!

    Supported Languages:

    • πŸ‡ΊπŸ‡Έ English: Original ResembleAI model (default)
    • πŸ‡©πŸ‡ͺ German: High-quality German ChatterBox model (stlohrey/chatterbox_de)
    • πŸ‡³πŸ‡΄ Norwegian: Norwegian ChatterBox model (akhbar/chatterbox-tts-norwegian)

    Key Features:

    • Language Dropdown: Simple language selection in all ChatterBox nodes
    • Auto-Download: Models download automatically on first use (~1GB per language)
    • Local Priority: Prefers locally installed models over downloads for offline use
    • Safetensors Support: Modern format support for newer language models
    • Seamless Integration: Works with existing workflows - just select your language

    Usage: Select language from dropdown β†’ First generation downloads model β†’ Subsequent generations use cached model

    </details> <details> <summary><h3>βš™οΈ Universal Streaming Architecture</h3></summary>

    NEW in v4.3.0: Complete architectural overhaul implementing universal streaming system with parallel processing capabilities!

    Key Features:

    • Universal Streaming Infrastructure: Unified processing system eliminating engine-specific code complexity
    • Parallel Processing: Configurable worker-based processing via batch_size parameter
    • Thread-Safe Design: Stateless wrapper architecture eliminates shared state corruption
    • Future-Proof: New engines require only adapter implementation

    Performance Notes:

    • Sequential Recommended: Use batch_size=0 for optimal performance (sequential processing)
    • Parallel Available: batch_size > 1 enables parallel workers but typically slower due to GPU inference characteristics
    • Memory Efficiency: Improved model sharing prevents memory exhaustion when switching modes

    β†’ πŸ“– Read Technical Details

    </details> </details>

    πŸš€ Quick Start

    Option 1: ComfyUI Manager (Recommended) ✨

    One-click installation with intelligent dependency management:

    1. Use ComfyUI Manager to install "TTS Audio Suite"
    2. That's it! ComfyUI Manager automatically runs our install.py script which handles:
      • βœ… Python 3.13 compatibility (MediaPipe β†’ OpenSeeFace fallback)
      • βœ… Dependency conflicts (NumPy, librosa, etc.)
      • βœ… All bundled engines (ChatterBox, F5-TTS, Higgs Audio)
      • βœ… RVC voice conversion dependencies
      • βœ… Intelligent conflict resolution with --no-deps handling

    Python 3.13 Support:

    • 🟒 All TTS engines: ChatterBox, F5-TTS, Higgs Audio βœ… Working
    • 🟒 RVC voice conversion: βœ… Working
    • 🟒 OpenSeeFace mouth movement: βœ… Working (experimental)
    • πŸ”΄ MediaPipe mouth movement: ❌ Incompatible (use OpenSeeFace)

    Option 2: Manual Installation

    Same intelligent installer, manual setup:

    1. Clone the repository

      cd ComfyUI/custom_nodes
      git clone https://github.com/diodiogod/TTS-Audio-Suite.git
      cd TTS-Audio-Suite
      
    2. Run the intelligent installer:

      ComfyUI Portable:

      # Windows:
      ..\..\..\python_embeded\python.exe install.py
      
      # Linux/Mac:
      ../../../python_embeded/python.exe install.py
      

      ComfyUI with venv/conda:

      # First activate your ComfyUI environment, then:
      python install.py
      

      The installer automatically handles all dependency conflicts and Python version compatibility.

    3. Manual Download Models (OR It will auto-download on first run)

      • Download from HuggingFace ChatterBox
      • Place in ComfyUI/models/TTS/chatterbox/English/ (recommended) or ComfyUI/models/chatterbox/ (legacy)
    4. Try a Workflow

    5. Restart ComfyUI and look for 🎀 TTS Audio Suite nodes

    πŸ§ͺ Python 3.13 Users: Installation is fully supported! The system automatically uses OpenSeeFace for mouth movement analysis when MediaPipe is unavailable.

    Need F5-TTS? Also download F5-TTS models to ComfyUI/models/F5-TTS/ from the links in the detailed installation below.

    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Installation

    <details> <summary>πŸ“‹ Detailed Installation Guide (Click to expand if you're having dependency issues)</summary>

    This section provides a detailed guide for installing TTS Audio Suite, covering different ComfyUI installation methods.

    Prerequisites

    • ComfyUI installation (Portable, Direct with venv, or through Manager)

    • Python 3.12 or higher

    • System libraries (Linux only):

      # Ubuntu/Debian - Required for audio processing
      sudo apt-get install portaudio19-dev libsamplerate0-dev
      
      # Fedora/RHEL
      sudo dnf install portaudio-devel libsamplerate-devel
      

      πŸ“‹ Why needed? libsamplerate0-dev provides audio resampling libraries for packages like resampy and soxr. portaudio19-dev enables voice recording features.

    • macOS dependencies:

      brew install portaudio
      
    • Windows: No additional system dependencies needed (libraries come pre-compiled)

    Installation Methods

    1. Portable Installation

    For portable installations, follow these steps:

    1. Clone the repository into the ComfyUI/custom_nodes folder:

      cd ComfyUI/custom_nodes
      git clone https://github.com/diodiogod/TTS-Audio-Suite.git
      
    2. Navigate to the cloned directory:

      cd TTS-Audio-Suite
      
    3. Install the required dependencies. Important: Use the python.exe executable located in your ComfyUI portable installation with environment isolation flags.

      ../../../python_embeded/python.exe -m pip install -r requirements.txt --no-user
      

      Why the --no-user flag?

      • Prevents installing to your system Python's user directory, which can cause import conflicts
      • Ensures packages install only to the portable environment for proper isolation

    2. Direct Installation with venv

    If you have a direct installation with a virtual environment (venv), follow these steps:

    1. Clone the repository into the ComfyUI/custom_nodes folder:

      cd ComfyUI/custom_nodes
      git clone https://github.com/diodiogod/TTS-Audio-Suite.git
      
    2. Activate your ComfyUI virtual environment. This is crucial to ensure dependencies are installed in the correct environment. The method to activate the venv may vary depending on your setup. Here's a common example:

      cd ComfyUI
      . ./venv/bin/activate
      

      or on Windows:

      ComfyUI\venv\Scripts\activate
      
    3. Navigate to the cloned directory:

      cd custom_nodes/TTS-Audio-Suite
      
    4. Install the required dependencies using pip:

      pip install -r requirements.txt
      

    3. Installation through the ComfyUI Manager

    1. Install the ComfyUI Manager if you haven't already.

    2. Use the Manager to install the "TTS Audio Suite" node.

    3. The manager might handle dependencies automatically, but it's still recommended to verify the installation. Navigate to the node's directory:

      cd ComfyUI/custom_nodes/TTS-Audio-Suite
      
    4. Activate your ComfyUI virtual environment (see instructions in "Direct Installation with venv").

    5. If you encounter issues, manually install the dependencies:

      pip install -r requirements.txt
      

    Troubleshooting Dependency Issues

    System Dependencies (Linux)

    Our install script automatically detects missing system libraries and will display helpful error messages like:

    [!] Missing system dependencies detected!
    ============================================================
    SYSTEM DEPENDENCIES REQUIRED
    ============================================================
    β€’ libsamplerate0-dev (for audio resampling)  
    β€’ portaudio19-dev (for voice recording)
    
    Please install with:
    # Ubuntu/Debian:
    sudo apt-get install libsamplerate0-dev portaudio19-dev
    
    # Fedora/RHEL:
    sudo dnf install libsamplerate-devel portaudio-devel
    ============================================================
    Then run this install script again.
    

    Python Environment Issues

    A common problem is installing dependencies in the wrong Python environment. Always ensure you are installing dependencies within your ComfyUI's Python environment.

    • Verify your Python environment: After activating your venv or navigating to your portable ComfyUI installation, check the Python executable being used:

      which python
      

      This should point to the Python executable within your ComfyUI installation (e.g., ComfyUI/python_embeded/python.exe or ComfyUI/venv/bin/python).

    • If s3tokenizer fails to install: This dependency can be problematic. Try upgrading your pip and setuptools:

      python -m pip install --upgrade pip setuptools wheel
      

      Then, try installing the requirements again.

    • If you cloned the node manually (without the Manager): Make sure you install the requirements.txt file.

    Updating the Node

    To update the node to the latest version:

    1. Navigate to the node's directory:

      cd ComfyUI/custom_nodes/TTS-Audio-Suite
      
    2. Pull the latest changes from the repository:

      git pull
      
    3. Reinstall the dependencies (in case they have been updated):

      pip install -r requirements.txt
      
    </details>

    1. Clone Repository

    cd ComfyUI/custom_nodes
    git clone https://github.com/diodiogod/TTS-Audio-Suite.git
    

    2. Install Dependencies

    Some dependencies, particularly s3tokenizer, can occasionally cause installation issues on certain Python setups (e.g., Python 3.10, sometimes used by tools like Stability Matrix).

    To minimize potential problems, it's highly recommended to first ensure your core packaging tools are up-to-date in your ComfyUI's virtual environment:

    python -m pip install --upgrade pip setuptools wheel
    

    After running the command above, install the node's specific requirements:

    pip install -r requirements.txt
    

    3. Optional: Install FFmpeg for Enhanced Audio Processing

    ChatterBox Voice now supports FFmpeg for high-quality audio stretching. While not required, it's recommended for the best audio quality:

    Windows:

    winget install FFmpeg
    # or with Chocolatey
    choco install ffmpeg
    

    macOS:

    brew install ffmpeg
    

    Linux:

    # Ubuntu/Debian
    sudo apt-get install ffmpeg
    
    # Fedora
    sudo dnf install ffmpeg
    

    If FFmpeg is not available, ChatterBox will automatically fall back to using the built-in phase vocoder method for audio stretching - your workflows will continue to work without interruption.

    4. Download Models

    Download the ChatterboxTTS models and place them in the new organized structure:

    ComfyUI/models/TTS/chatterbox/    ← Recommended (new structure)
    

    Or use the legacy location (still supported):

    ComfyUI/models/chatterbox/        ← Legacy (still works)
    

    Required files:

    • conds.pt (105 KB)
    • s3gen.pt (~1 GB)
    • t3_cfg.pt (~1 GB)
    • tokenizer.json (25 KB)
    • ve.pt (5.5 MB)

    Download from: https://huggingface.co/ResembleAI/chatterbox/tree/main

    4.1. Multilanguage ChatterBox Models (Optional)

    NEW in v3.3.0: ChatterBox now supports multiple languages! Models will auto-download on first use, or you can manually install them for offline use.

    For manual installation, create language-specific folders in the organized structure:

    ComfyUI/models/TTS/chatterbox/    ← Recommended structure
    β”œβ”€β”€ English/          # Optional - for explicit English organization
    β”‚   β”œβ”€β”€ conds.pt
    β”‚   β”œβ”€β”€ s3gen.pt
    β”‚   β”œβ”€β”€ t3_cfg.pt
    β”‚   β”œβ”€β”€ tokenizer.json
    β”‚   └── ve.pt
    β”œβ”€β”€ German/           # German language models
    β”‚   β”œβ”€β”€ conds.safetensors
    β”‚   β”œβ”€β”€ s3gen.safetensors
    β”‚   β”œβ”€β”€ t3_cfg.safetensors
    β”‚   β”œβ”€β”€ tokenizer.json
    β”‚   └── ve.safetensors
    └── Norwegian/        # Norwegian language models
        β”œβ”€β”€ conds.safetensors
        β”œβ”€β”€ s3gen.safetensors
        β”œβ”€β”€ t3_cfg.safetensors
        β”œβ”€β”€ tokenizer.json
        └── ve.safetensors
    

    Note: Legacy location ComfyUI/models/chatterbox/ still works for backward compatibility.

    Available ChatterBox Language Models:

    | Language | HuggingFace Repository | Format | Auto-Download | | --------- | ----------------------------------------------------------------------------------------- | ------------ | ------------- | | English | ResembleAI/chatterbox | .pt | βœ… | | German | stlohrey/chatterbox_de | .safetensors | βœ… | | Norwegian | akhbar/chatterbox-tts-norwegian | .safetensors | βœ… |

    Usage: Simply select your desired language from the dropdown in ChatterBox TTS or SRT nodes. First generation will auto-download the model (~1GB per language).

    5. F5-TTS Models (Optional)

    For F5-TTS voice synthesis capabilities, download F5-TTS models and place them in the organized structure:

    ComfyUI/models/TTS/F5-TTS/       ← Recommended (new structure)
    

    Or use the legacy location (still supported):

    ComfyUI/models/F5-TTS/           ← Legacy (still works)
    

    Available F5-TTS Models:

    | Model | Language | Download | Size | | ------------------ | ---------------- | -------------------------------------------------------------------------------- | ------ | | F5TTS_Base | English | HuggingFace | ~1.2GB | | F5TTS_v1_Base | English (v1) | HuggingFace | ~1.2GB | | E2TTS_Base | English (E2-TTS) | HuggingFace | ~1.2GB | | F5-DE | German | HuggingFace | ~1.2GB | | F5-ES | Spanish | HuggingFace | ~1.2GB | | F5-FR | French | HuggingFace | ~1.2GB | | F5-JP | Japanese | HuggingFace | ~1.2GB | | F5-Hindi-Small | Hindi | HuggingFace | ~632MB |

    Vocoder (Optional but Recommended):

    ComfyUI/models/TTS/F5-TTS/vocos/     ← Recommended
    β”œβ”€β”€ config.yaml
    β”œβ”€β”€ pytorch_model.bin
    └── vocab.txt
    

    Legacy location also supported: ComfyUI/models/F5-TTS/vocos/

    Download from: Vocos Mel-24kHz

    Complete Folder Structure (Recommended):

    ComfyUI/models/TTS/F5-TTS/
    β”œβ”€β”€ F5TTS_Base/
    β”‚   β”œβ”€β”€ model_1200000.safetensors    ← Main model file
    β”‚   └── vocab.txt                    ← Vocabulary file
    β”œβ”€β”€ vocos/                           ← For offline vocoder
    β”‚   β”œβ”€β”€ config.yaml
    β”‚   └── pytorch_model.bin
    └── F5TTS_v1_Base/
        β”œβ”€β”€ model_1250000.safetensors
        └── vocab.txt
    

    Required Files for Each Model:

    • model_XXXXXX.safetensors - The main model weights
    • vocab.txt - Vocabulary/tokenizer file (download from same HuggingFace repo)

    Note: F5-TTS uses internal config files, no config.yaml needed. Vocos vocoder doesn't need vocab.txt.

    Note: F5-TTS models and vocoder will auto-download from HuggingFace if not found locally. The first generation may take longer while downloading (~1.2GB per model).

    6. F5-TTS Voice References Setup

    For easy voice reference management, create a dedicated voices folder:

    ComfyUI/models/voices/
    β”œβ”€β”€ character1.wav
    β”œβ”€β”€ character1.reference.txt ← Contains: "Hello, I am character one speaking clearly."
    β”œβ”€β”€ character1.txt          ← Contains: "BBC Radio sample, licensed under CC3..."
    β”œβ”€β”€ narrator.wav
    β”œβ”€β”€ narrator.txt            ← Contains: "This is the narrator voice for storytelling."
    β”œβ”€β”€ my_voice.wav
    └── my_voice.txt            ← Contains: "This is my personal voice sample."
    

    Voice Reference Requirements:

    • Audio files: WAV format, 5-30 seconds, clean speech, 24kHz recommended
    • Text files: Exact transcription of what's spoken in the audio file
    • Naming: filename.wav + filename.reference.txt (preferred) or filename.txt (fallback)
    • Character Names: Character name = audio filename (without extension). Subfolders supported for organization.

    ⚠️ F5-TTS Best Practices: Follow these guidelines to avoid inference failures

    <details> <summary><strong>πŸ“‹ F5-TTS Inference Guidelines</strong></summary>

    To avoid possible inference failures, make sure you follow these F5-TTS optimization guidelines:

    1. Reference Audio Duration: Use reference audio <12s and leave proper silence space (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.

    2. Letter Case Handling: Uppercased letters (best with form like K.F.C.) will be uttered letter by letter, and lowercased letters used for common words.

    3. Pause Control: Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.

    4. Punctuation Spacing: If English punctuation marks the end of a sentence, make sure there is a space " " after it. Otherwise not regarded as sentence chunk.

    5. Number Processing: Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise they will be read in English.

    These guidelines help ensure optimal F5-TTS generation quality and prevent common audio artifacts.

    </details>

    7. Higgs Audio 2 Models (Optional - NEW in v4.5.0+)

    For state-of-the-art voice cloning capabilities, Higgs Audio 2 models are automatically downloaded to the organized structure:

    ComfyUI/models/TTS/HiggsAudio/        ← Recommended (new structure)
    β”œβ”€β”€ higgs-audio-v2-3B/               ← Main model directory
    β”‚   β”œβ”€β”€ generation/                  ← Generation model files
    β”‚   β”‚   β”œβ”€β”€ config.json
    β”‚   β”‚   β”œβ”€β”€ model.safetensors.index.json
    β”‚   β”‚   β”œβ”€β”€ model-00001-of-00003.safetensors (~3GB)
    β”‚   β”‚   β”œβ”€β”€ model-00002-of-00003.safetensors (~3GB) 
    β”‚   β”‚   β”œβ”€β”€ model-00003-of-00003.safetensors (~3GB)
    β”‚   β”‚   β”œβ”€β”€ generation_config.json
    β”‚   β”‚   β”œβ”€β”€ tokenizer.json
    β”‚   β”‚   β”œβ”€β”€ tokenizer_config.json
    β”‚   β”‚   └── special_tokens_map.json
    β”‚   └── tokenizer/                   ← Audio tokenizer files
    β”‚       β”œβ”€β”€ config.json
    β”‚       └── model.pth (~200MB)
    └── voices/                          ← Voice reference files
        β”œβ”€β”€ character1.wav               ← 30+ second reference audio
        β”œβ”€β”€ character1.txt               ← Exact transcription
        β”œβ”€β”€ narrator.wav
        └── narrator.txt
    

    Available Higgs Audio Models (Auto-Download):

    | Model | Type | Source | Size | Auto-Download | | ----------------- | ------------- | ------------------------------------------------------------------------------------------------------------- | ------ | ------------- | | higgs-audio-v2-3B | Voice Cloning | bosonai/higgs-audio-v2-generation-3B-base | ~9GB | βœ… | | Audio Tokenizer | Tokenization | bosonai/higgs-audio-v2-tokenizer | ~200MB | βœ… |

    Voice Reference Requirements:

    • Audio files: WAV format, 30+ seconds, clean speech, single speaker
    • Text files: Exact transcription of the reference audio
    • Naming: filename.wav + filename.txt (transcription)
    • Quality: Clear, noise-free audio for best voice cloning results

    How Higgs Audio Auto-Download Works:

    1. Select Model: Choose "higgs-audio-v2-3B" in Higgs Audio Engine node
    2. Auto-Download: Both generation model (~9GB) and tokenizer (~200MB) download automatically
    3. Voice References: Place reference audio and transcriptions in voices/ folder
    4. Local Cache: Once downloaded, models are used from local cache for fast loading

    Manual Installation (Optional):

    To pre-download models for offline use:

    # Download generation model files to:
    # ComfyUI/models/TTS/HiggsAudio/higgs-audio-v2-3B/generation/
    
    # Download tokenizer files to:  
    # ComfyUI/models/TTS/HiggsAudio/higgs-audio-v2-3B/tokenizer/
    

    Usage: Simply use the βš™οΈ Higgs Audio 2 Engine node β†’ Select model β†’ All required files download automatically!

    8. RVC Models (Optional - NEW in v4.0.0+)

    For Real-time Voice Conversion capabilities, RVC models are automatically downloaded to the organized structure:

    ComfyUI/models/TTS/RVC/          ← Recommended (new structure)
    β”œβ”€β”€ Claire.pth                   ← Character voice models
    β”œβ”€β”€ Sayano.pth
    β”œβ”€β”€ Mae_v2.pth
    β”œβ”€β”€ Fuji.pth
    β”œβ”€β”€ Monika.pth
    β”œβ”€β”€ content-vec-best.safetensors ← Base models (auto-download)
    β”œβ”€β”€ rmvpe.pt
    β”œβ”€β”€ hubert/                      ← HuBERT models (auto-organized)
    β”‚   β”œβ”€β”€ hubert-base-rvc.safetensors
    β”‚   β”œβ”€β”€ hubert-soft-japanese.safetensors
    β”‚   └── hubert-soft-korean.safetensors
    └── .index/                      ← Index files for better similarity
        β”œβ”€β”€ added_IVF1063_Flat_nprobe_1_Sayano_v2.index
        β”œβ”€β”€ added_IVF985_Flat_nprobe_1_Fuji_v2.index
        β”œβ”€β”€ Monika_v2_40k.index
        └── Sayano_v2_40k.index
    

    Note: Legacy location ComfyUI/models/RVC/ still works for backward compatibility.

    Available RVC Character Models (Auto-Download):

    | Model | Type | Source | Auto-Download | | ---------- | --------- | -------------------------------------------------------------------------- | ------------- | | Claire.pth | Character | SayanoAI RVC-Studio | βœ… | | Sayano.pth | Character | SayanoAI RVC-Studio | βœ… | | Mae_v2.pth | Character | SayanoAI RVC-Studio | βœ… | | Fuji.pth | Character | SayanoAI RVC-Studio | βœ… | | Monika.pth | Character | SayanoAI RVC-Studio | βœ… |

    Required Base Models (Auto-Download):

    | Model | Purpose | Source | Size | | ---------------------------- | ---------------- | --------------------------------------------------------------------------------- | ------ | | content-vec-best.safetensors | Voice features | lengyue233/content-vec-best | ~300MB | | rmvpe.pt | Pitch extraction | lj1995/VoiceConversionWebUI | ~55MB |

    How RVC Auto-Download Works:

    1. Select Character Model: Choose from available models in 🎭 Load RVC Character Model node
    2. Auto-Download: Models download automatically when first selected (with auto_download=True)
    3. Base Models: Required base models download automatically when RVC engine first runs
    4. Index Files: Optional FAISS index files download for improved voice similarity
    5. Local Cache: Once downloaded, models are used from local cache for fast loading

    UVR Models for Vocal Separation (Auto-Download):

    Additional models for the 🀐 Noise or Vocal Removal node download to ComfyUI/models/TTS/UVR/ (recommended) or ComfyUI/models/UVR/ (legacy) as needed.

    Usage: Simply use the 🎭 Load RVC Character Model node β†’ Select a character β†’ Connect to Voice Changer node. All required models download automatically!

    8. Restart ComfyUI

    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Enhanced Features

    πŸ“ Intelligent Text Chunking (NEW!)

    Long text support with smart processing:

    • Character-based limits (100-1000 chars per chunk)
    • Sentence boundary preservation - won't cut mid-sentence
    • Multiple combination methods:
      • auto - Smart selection based on text length
      • concatenate - Simple joining
      • silence_padding - Add configurable silence between chunks
      • crossfade - Smooth audio blending
    • Comma-based splitting for very long sentences
    • Backward compatible - works with existing workflows

    Chunking Controls (all optional):

    • enable_chunking - Enable/disable smart chunking (default: True)
    • max_chars_per_chunk - Chunk size limit (default: 400)
    • chunk_combination_method - How to join audio (default: auto)
    • silence_between_chunks_ms - Silence duration (default: 100ms)

    Auto-selection logic:

    • Text > 1000 chars β†’ silence_padding (natural pauses)
    • Text > 500 chars β†’ crossfade (smooth blending)
    • Text < 500 chars β†’ concatenate (simple joining)

    πŸ“¦ Smart Model Loading

    Priority-based model detection:

    1. Bundled models in node folder (self-contained)
    2. ComfyUI models in standard location
    3. HuggingFace download with authentication

    Console output shows source:

    πŸ“¦ Using BUNDLED ChatterBox (self-contained)
    πŸ“¦ Loading from bundled models: ./models/chatterbox
    βœ… ChatterboxTTS model loaded from bundled!
    
    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Usage

    Voice Recording

    1. Add "🎀 ChatterBox Voice Capture" node
    2. Select your microphone from the dropdown
    3. Adjust recording settings:
      • Silence Threshold: How quiet to consider "silence" (0.001-0.1)
      • Silence Duration: How long to wait before stopping (0.5-5.0 seconds)
      • Sample Rate: Audio quality (8000-96000 Hz, default 44100)
    4. Change the Trigger value to start a new recording
    5. Connect output to TTS (for voice cloning) or VC nodes

    Enhanced Text-to-Speech

    1. Add "🎀 ChatterBox Voice TTS" node
    2. Enter your text (any length - automatic chunking)
    3. Optionally connect reference audio for voice cloning
    4. Adjust TTS settings:
      • Exaggeration: Emotion intensity (0.25-2.0)
      • Temperature: Randomness (0.05-5.0)
      • CFG Weight: Guidance strength (0.0-1.0)

    F5-TTS Voice Synthesis

    1. Add "🎀 F5-TTS Voice Generation" node
    2. Enter your target text (any length - automatic chunking)
    3. Required: Connect reference audio for voice cloning
    4. Required: Enter reference text that matches the reference audio exactly
    <details> <summary>πŸ“– Voice Reference Setup Options</summary>

    Two ways to provide voice references:

    1. Easy Method: Select voice from reference_audio_file dropdown β†’ text auto-detected from companion .txt file
    2. Manual Method: Set reference_audio_file to "none" β†’ connect opt_reference_audio + opt_reference_text inputs
    </details>
    1. Select F5-TTS model:
      • F5TTS_Base: English base model (recommended)
      • F5TTS_v1_Base: English v1 model
      • E2TTS_Base: E2-TTS model
      • F5-DE: German model
      • F5-ES: Spanish model
      • F5-FR: French model
      • F5-JP: Japanese model
    2. Adjust F5-TTS settings:
      • Temperature: Voice variation (0.1-2.0, default: 0.8)
      • Speed: Speech speed (0.5-2.0, default: 1.0)
      • CFG Strength: Guidance strength (0.0-10.0, default: 2.0)
      • NFE Step: Quality vs speed (1-100, default: 32)

    Voice Conversion with Iterative Refinement

    1. Add "πŸ”„ ChatterBox Voice Conversion" node
    2. Connect source audio (voice to convert)
    3. Connect target audio (voice style to copy)
    4. Configure refinement settings:
      • Refinement Passes: Number of conversion iterations (1-30, recommended 1-5)
      • Each pass refines the output to sound more like the target
      • Smart Caching: Results cached up to 5 iterations for instant experimentation

    🧠 Intelligent Caching Examples:

    • Run 3 passes β†’ caches iterations 1, 2, 3
    • Change to 5 passes β†’ resumes from cached 3, runs 4, 5
    • Change to 2 passes β†’ returns cached iteration 2 instantly
    • Change to 4 passes β†’ resumes from cached 3, runs 4

    πŸ’‘ Pro Tip: Start with 1 pass, then experiment with 2-5 passes to find the sweet spot for your audio. Each iteration can improves voice similarity!

    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    πŸ“ Example Workflows

    Ready-to-use ComfyUI workflows - Download and drag into ComfyUI:

    πŸ†• Unified Workflows (v4.5+)

    | Workflow | Description | Features | Status | Files | | ---------------------------- | --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------------------------- | | Unified πŸ“Ί TTS SRT | Universal SRT processing with all TTS engines | β€’ ChatterBox/F5-TTS/Higgs Audio 2<br>β€’ Multiple timing modes<br>β€’ Multi-character switching<br>β€’ Overlap SRT support | βœ… New in v4.5 | πŸ“ JSON | | Unified πŸ”„ Voice Changer | Modern voice conversion with multiple engines | β€’ RVC + ChatterBox VC<br>β€’ Iterative refinement<br>β€’ Real-time conversion | βœ… Updated for v4.3 | πŸ“ JSON |

    Specific Workflows

    | Workflow | Description | Status | Files | | -------------------------- | ------------------------------------------------ | ---------------- | -------------------------------------------------------------------------- | | ChatterBox Integration | General ChatterBox TTS and Voice Conversion | βœ… Compatible | πŸ“ JSON | | F5-TTS Speech Editor | Interactive waveform analysis for F5-TTS editing | βœ… Updated for v4 | πŸ“ JSON |

    πŸ’‘ Recommended: Use the new Unified πŸ“Ί TTS SRT workflow which showcases all engines and features in one comprehensive workflow. It demonstrates SRT processing, timing modes, multi-character switching, and supports ChatterBox, F5-TTS, and Higgs Audio 2 engines.

    πŸ“₯ Usage: Download the .json files and drag them directly into your ComfyUI interface. The workflows will automatically load with proper node connections.

    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Settings Guide

    Enhanced Chunking Settings

    For Long Articles/Books:

    • max_chars_per_chunk=600, combination_method=silence_padding, silence_between_chunks_ms=200

    For Natural Speech:

    • max_chars_per_chunk=400, combination_method=auto (default - works well)

    For Fast Processing:

    • max_chars_per_chunk=800, combination_method=concatenate

    For Smooth Audio:

    • max_chars_per_chunk=300, combination_method=crossfade

    Voice Recording Settings

    General Recording:

    • silence_threshold=0.01, silence_duration=2.0 (default settings)

    Noisy Environment:

    • Higher silence_threshold (~0.05) to ignore background noise
    • Longer silence_duration (~3.0) to avoid cutting off speech

    Quiet Environment:

    • Lower silence_threshold (~0.005) for sensitive detection
    • Shorter silence_duration (~1.0) for quick stopping

    TTS Settings

    General Use:

    • exaggeration=0.5, cfg_weight=0.5 (default settings work well)

    Expressive Speech:

    • Lower cfg_weight (~0.3) + higher exaggeration (~0.7)
    • Higher exaggeration speeds up speech; lower CFG slows it down
    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Text Processing Capabilities

    πŸ“š No Hard Text Limits!

    Unlike many TTS systems:

    • OpenAI TTS: 4096 character limit
    • ElevenLabs: 2500 character limit
    • ChatterBox: No documented limits + intelligent chunking

    🧠 Smart Text Splitting

    Sentence Boundary Detection:

    • Splits on .!? with proper spacing
    • Preserves sentence integrity
    • Handles abbreviations and edge cases

    Long Sentence Handling:

    • Splits on commas when sentences are too long
    • Maintains natural speech patterns
    • Falls back to character limits only when necessary
    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    License

    MIT License - Same as ChatterboxTTS

    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    Credits

    • ResembleAI for ChatterboxTTS
    • ComfyUI team for the amazing framework
    • sounddevice library for audio recording functionality
    • ShmuelRonen for the Original ChatteBox Voice TTS node
    • Diogod for the TTS Audio Suite universal multi-engine implementation
    <div align="right"><a href="#-table-of-contents">Back to top</a></div>

    πŸ”— Links


    Note: The original ChatterBox model includes Resemble AI's Perth watermarking system for responsible AI usage. This ComfyUI integration includes the Perth dependency but has watermarking disabled by default to ensure maximum compatibility. Users can re-enable watermarking by modifying the code if needed, while maintaining the full quality and capabilities of the underlying TTS model.

    <!-- MARKDOWN LINKS & IMAGES --> <!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->