A ComfyUI custom node based on ByteDance MegaTTS3 MegaTTS3, enabling high-quality text-to-speech synthesis with voice cloning capabilities for both Chinese and English.
(English / 中文)
A ComfyUI custom node based on ByteDance MegaTTS3, enabling high-quality text-to-speech synthesis with voice cloning capabilities for both Chinese and English.
Clone this repository to ComfyUI's custom_nodes
directory:
cd ComfyUI/custom_nodes
git clone https://github.com/1038lab/ComfyUI-MegaTTS.git
Install required dependencies:
cd ComfyUI-MegaTTS
pip install -r requirements.txt
The node will automatically download required models on first use, or you can manually download them:
This extension uses modified versions of ByteDance's MegaTTS3 models. While the models are automatically downloaded during first use, you can manually download them from Hugging Face:
The models are organized in the following structure:
model_path/TTS/MegaTTS3/
├── diffusion_transformer/
│ ├── config.yaml
│ └── model_only_last.ckpt
├── wavvae/
│ ├── config.yaml
│ └── decoder.ckpt
├── duration_lm/
│ ├── config.yaml
│ └── model_only_last.ckpt
├── aligner_lm/
│ ├── config.yaml
│ └── model_only_last.ckpt
└── g2p/
├── config.json
├── model.safetensors
├── generation_config.json
├── tokenizer_config.json
├── special_tokens_map.json
├── tokenizer.json
├── vocab.json
└── merges.txt
Direct Download from Hugging Face:
comfyui/models/TTS/MegaTTS3/
Using Hugging Face CLI:
# Install huggingface_hub if you don't have it
pip install huggingface_hub
# Download all models
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='ByteDance/MegaTTS3', local_dir='comfyui/models/TTS/MegaTTS3/')"
[!IMPORTANT]
The WaveVAE encoder is currently not available.For security reasons, Bytedance has not uploaded the WaveVAE encoder.
You can only use pre-extracted latents (.npy files) for inference.
To synthesize speech for a specific speaker, ensure both the corresponding WAV and NPY files are in the same directory.
Refer to the Bytedance MegaTTS3 repository for details on obtaining necessary files or submitting your voice samples.
The extension requires a Voices
folder to store reference voice samples and their extracted features:
Voices/
├── sample1.wav # Reference audio file
├── sample1.npy # Extracted features from the audio file
├── sample2.wav # Another reference audio
└── sample2.npy # Corresponding features
Download pre-extracted samples:
Submit your own voice samples:
Generate NPY files with Voice Maker:
This extension includes a Voice Maker custom node that helps you prepare voice samples:
How to use the Voice Maker:
For best results:
This model offers excellent control over accents and pronunciation:
For preserving the speaker's accent:
For standard pronunciation:
For emotional or expressive speech:
| Use Case | p_w (pronunciation_strength) | t_w (voice_similarity) | |----------|------------------------------|------------------------| | Standard TTS | 2.0 | 3.0 | | Preserve Accent | 1.0-1.5 | 3.0-5.0 | | Cross-lingual (standard) | 3.0-4.0 | 3.0-5.0 | | Emotional Speech | 1.5-2.5 | 3.0-5.0 | | Noisy Reference Audio | 3.0-5.0 | 3.0-5.0 |
This extension provides three main nodes:
Full-featured TTS node with complete parameter control.
Inputs:
input_text
- Text to convert to speechlanguage
- Language selection (en: English, zh: Chinese)generation_quality
- Controls the number of diffusion steps (higher = better quality but slower)pronunciation_strength
(p_w) - Controls pronunciation accuracy (higher values produce more standard pronunciation)voice_similarity
(t_w) - Controls similarity to reference voice (higher values produce speech more similar to reference)reference_voice
- Reference voice file from Voices folderOutputs:
AUDIO
- Generated audio in WAV formatLATENT
- Audio latent representation for further processingSimplified TTS node with default parameters for quick usage.
Inputs:
input_text
- Text to convert to speechlanguage
- Language selection (en: English, zh: Chinese)reference_voice
- Reference voice file from Voices folderOutputs:
AUDIO
- Generated audio in WAV formatUtility node to free GPU memory after TTS processing.
| Parameter | Description | Recommended Values | |-----------|-------------|-------------------| | generation_quality | Controls the number of diffusion steps. Higher values produce better quality but increase generation time. | Default: 10. Range: 1-50. For quick tests: 1-5, for final output: 15-30. | | pronunciation_strength (p_w) | Controls how closely the output follows standard pronunciation. | Default: 2.0. Range: 1.0-5.0. For accent preservation: 1.0-1.5, for standard pronunciation: 2.5-4.0. | | voice_similarity (t_w) | Controls how similar the output is to the reference voice. | Default: 3.0. Range: 1.0-5.0. For more expressive output with preserved voice characteristics: 3.0-5.0. |
Voices
foldervoice_name.wav
- Voice sample file (24kHz sample rate recommended, 5-10 seconds of clear speech)voice_name.npy
- Corresponding voice feature file (generated automatically if voice extraction is enabled)Voices
folderFor cloning a voice across languages (e.g., making an English speaker speak Chinese):
GPL-3.0 License