ComfyUI Extension: ComfyUI-FireRedTTS

Authored by 1038lab

Created 3 months ago

Updated 3 months ago

41 stars

A ComfyUI integration for FireRedTTS‑2, a real-time multi-speaker TTS system enabling high-quality, emotionally expressive dialogue and monologue synthesis. Leveraging a streaming architecture and context-aware prosody modeling, it supports natural speaker turns and stable long-form generation, ideal for interactive chat and podcast applications.

Custom Nodes (0)

README

ComfyUI-FireRedTTS

Features

Dialogue Generation: Multi-speaker conversation audio generation
Monologue Generation: Single-speaker narrative audio generation
Voice Cloning: Zero-shot voice cloning functionality
Multi-language Support: Chinese, English, Japanese, Korean, French, German, Russian
Automatic Model Download: Models download automatically on first use
Device Adaptive: Automatically selects optimal device (CUDA/MPS/CPU)

Installation

Method 1: ComfyUI Manager (Recommended)

open [ComfyUI Manager]
Search for "ComfyUI-FireRedTTS" in ComfyUI Manager
Click Install

Method 2: Manual Installation

Clone this repository to your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/1038lab/ComfyUI-FireRedTTS.git

Install dependencies:

cd ComfyUI-FireRedTTS
pip install -r requirements.txt

Restart ComfyUI

Model Download

On first use, the system will automatically download the FireRedTTS2 model from Hugging Face:

Model source: FireRedTeam/FireRedTTS2
Storage location: ComfyUI\models\TTS\FireRedTTS2
Download size: ~2GB

A progress bar will show during download. Once complete, the model is cached for future use.

Nodes

FireRedTTS2 Dialogue Node

Generates Multi-speaker dialogue audio.

Inputs:

text_list (STRING): Dialogue text with speaker tags ([S1], [S2])
temperature (FLOAT): Controls generation randomness (0.1-2.0, default: 0.9)
topk (INT): Controls sampling range (1-100, default: 30)
S1 (AUDIO, optional): Reference audio for Speaker 1
S1_text (STRING, optional): Reference text for Speaker 1
S2 (AUDIO, optional): Reference audio for Speaker 2
S2_text (STRING, optional): Reference text for Speaker 2

Outputs:

audio (AUDIO): Generated dialogue audio
sample_rate (INT): Audio sample rate (24000Hz)

FireRedTTS2 Monologue Node

Generates single-speaker monologue audio.

Inputs:

text (STRING): Input text content
temperature (FLOAT): Temperature parameter (0.1-2.0, default: 0.75)
topk (INT): TopK parameter (1-100, default: 20)
prompt_wav (STRING, optional): Reference audio file path
prompt_text (STRING, optional): Reference text content

Outputs:

audio (AUDIO): Generated monologue audio
sample_rate (INT): Audio sample rate (24000Hz)

Usage

Speaker Tag Format

Use square brackets to mark different speakers in dialogue text:

[S1]Hello, what a nice day![S2]Yes, perfect for a walk.[S1]Shall we go to the park?[S2]Great idea!

Supported speaker tags:

[S1] - Speaker 1
[S2] - Speaker 2

Voice Cloning Setup

For voice cloning, provide both audio and text for each speaker:

Speaker 1 (S1):

Connect reference audio to S1 input
Enter reference text in S1_text field

Speaker 2 (S2):

Connect reference audio to S2 input
Enter reference text in S2_text field

Examples

Basic Dialogue Generation

Add "FireRedTTS2 Dialogue" node

Input in text_list:

[S1]Welcome to our podcast![S2]Today we'll discuss AI development.[S1]That's a fascinating topic indeed.

Adjust temperature and topk parameters
Connect audio output to preview or save node

Voice Cloning Dialogue

Prepare reference audio files for each speaker
Connect Speaker 1 reference audio to S1 input
Enter Speaker 1 reference text in S1_text:
```
This is a voice sample for speaker one
```
Connect Speaker 2 reference audio to S2 input
Enter Speaker 2 reference text in S2_text:
```
This is a voice sample for speaker two
```

Monologue Generation

Add "FireRedTTS2 Monologue" node
Input long text content in text field
Optionally provide prompt_wav and prompt_text for voice cloning
Adjust parameters and generate audio

Parameter Guide

Temperature

Low (0.1-0.5): More stable, consistent speech
Medium (0.6-1.0): Balanced stability and naturalness
High (1.1-2.0): More variation and expressiveness, may be unstable

TopK

Low (1-20): Conservative sampling, more stable speech
Medium (21-50): Balanced choice
High (51-100): More diverse sampling, increased variation

Troubleshooting

Common Issues

Q: Model download fails A: Check network connection and Hugging Face access. Try using proxy or mirror sites.

Q: CUDA out of memory A:

Reduce input text length
Lower batch size
Use CPU mode by setting device="cpu" in code

Q: Poor audio quality A:

Check input text format is correct
Adjust temperature parameter (recommended 0.7-1.0)
Ensure reference audio quality is good (if using voice cloning)

Q: Speaker tags not working A:

Ensure correct tag format: [S1], [S2], etc.
Check for extra spaces around tags
Confirm text contains corresponding speaker tags

Q: Node loading fails A:

Check dependencies are properly installed
Verify ComfyUI version compatibility
Check console for error messages

Performance Optimization

Memory Optimization:

Long texts are automatically split for processing
Model instances are cached and reused
Recommended single text length: under 500 characters

Speed Optimization:

First use requires model download, subsequent uses are faster
GPU acceleration significantly improves generation speed
Batch processing multiple short texts is more efficient than single long text

System Requirements

Minimum:

Python 3.8+
4GB RAM
2GB storage space (for models)

Recommended:

Python 3.9+
8GB+ RAM
NVIDIA GPU (4GB+ VRAM)
SSD storage

Support

If you encounter issues, please check:

Dependencies are fully installed
Models downloaded correctly
Input format meets requirements
System resources are sufficient

For more technical details, refer to the project source code and FireRedTTS2 official documentation.