ComfyUI Extension: ComfyUI-MegaTTS

Authored by 1038lab

Created 8 months ago

Updated 5 months ago

48 stars

A ComfyUI custom node based on ByteDance MegaTTS3 MegaTTS3, enabling high-quality text-to-speech synthesis with voice cloning capabilities for both Chinese and English.

Custom Nodes (0)

README

ComfyUI-MegaTTS

(English / 中文)

A ComfyUI custom node based on ByteDance MegaTTS3, enabling high-quality text-to-speech synthesis with voice cloning capabilities for both Chinese and English. Comfyui-MegaTTS-NodeSamples

Update Logs

Version 1.0.2

Reconstructed the code and custom node for optimized performance and better GPU resource management.
Added enhanced memory management features to prevent low VRAM users from running out of memory.
i18n supported in English and Chinese

Version 1.0.1

Bug Fix

Features

High-Quality Voice Synthesis: Generate natural-sounding speech from text input
Voice Cloning: Clone any voice with just a short sample (requires both WAV and NPY files)
Bilingual Support: Works with both Chinese and English text, with code-switching capabilities
Advanced Parameter Control: Fine-tune generation quality, pronunciation accuracy, and voice similarity
Memory Management: Built-in functionality to optimize GPU resource usage
Automatic Model Download: Models are downloaded automatically when required

Installation

Prerequisites

ComfyUI installed and working
Python 3.10+ recommended
CUDA-compatible GPU with at least 4GB VRAM (8GB+ recommended for higher quality)

Steps

Clone this repository to ComfyUI's custom_nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/1038lab/ComfyUI-MegaTTS.git

Install required dependencies:

cd ComfyUI-MegaTTS
pip install -r requirements.txt

The node will automatically download required models on first use, or you can manually download them:

Models and Manual Download

This extension uses modified versions of ByteDance's MegaTTS3 models. While the models are automatically downloaded during first use, you can manually download them from Hugging Face:

Model Structure

The models are organized in the following structure:

model_path/TTS/MegaTTS3/
  ├── diffusion_transformer/
  │   ├── config.yaml
  │   └── model_only_last.ckpt
  ├── wavvae/
  │   ├── config.yaml
  │   └── decoder.ckpt
  ├── duration_lm/
  │   ├── config.yaml
  │   └── model_only_last.ckpt
  ├── aligner_lm/
  │   ├── config.yaml
  │   └── model_only_last.ckpt
  └── g2p/
      ├── config.json
      ├── model.safetensors
      ├── generation_config.json
      ├── tokenizer_config.json
      ├── special_tokens_map.json
      ├── tokenizer.json
      ├── vocab.json
      └── merges.txt

Manual Download Options

Direct Download from Hugging Face:
- Visit the ByteDance/MegaTTS3 repository
- Download each subfolder from the repository:
- Place the downloaded files in the corresponding directories under comfyui/models/TTS/MegaTTS3/

Using Hugging Face CLI:

# Install huggingface_hub if you don't have it
pip install huggingface_hub

# Download all models
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='ByteDance/MegaTTS3', local_dir='comfyui/models/TTS/MegaTTS3/')"

Voice Folder and Voice Maker

Voice_Maker

[!IMPORTANT]
The WaveVAE encoder is currently not available.

For security reasons, Bytedance has not uploaded the WaveVAE encoder.

You can only use pre-extracted latents (.npy files) for inference.

To synthesize speech for a specific speaker, ensure both the corresponding WAV and NPY files are in the same directory.

Refer to the Bytedance MegaTTS3 repository for details on obtaining necessary files or submitting your voice samples.

Voice Folder Structure

End of Selection

Voice Folder Structure

The extension requires a Voices folder to store reference voice samples and their extracted features:

Voices/
├── sample1.wav     # Reference audio file
├── sample1.npy     # Extracted features from the audio file
├── sample2.wav     # Another reference audio
└── sample2.npy     # Corresponding features

Getting Voice Samples and NPY Files

Download pre-extracted samples:
- Sample voice WAV and NPY files can be found in this Google Drive folder: Voice Samples and NPY Files
- This folder contains pre-extracted NPY files and their corresponding WAV samples organized in subfolders
Submit your own voice samples:
- If you want to use your own voice, you can submit samples to this Google Drive folder: Voice Submission Queue
- Your samples should be clear audio with minimal background noise and within 24 seconds
- After verification for safety, the ByteDance team will extract and provide NPY files for your samples
Generate NPY files with Voice Maker:
- Use the Voice Maker node to automatically process your audio and generate NPY files
- While this method is convenient, the quality may not match officially extracted NPY files
- Best for quick testing and experimentation with your own voice samples

Voice Maker Node

This extension includes a Voice Maker custom node that helps you prepare voice samples:

Voice Maker Node Features:
- Convert any audio file to the required 24kHz WAV format
- Extract NPY feature files from WAV samples
- Process and optimize voice samples for better quality
- Save processed files to the Voices folder automatically

How to use the Voice Maker:

Add the "Voice Maker" node from the 🧪AILab/🔊Audio category
Connect an audio input or select a file from your computer
Configure processing options (normalization, trimming, etc.)
Run the node to generate a ready-to-use voice sample with its NPY file

About WAV and NPY Files

WAV files: These are the actual voice samples you want to clone (24kHz recommended)
NPY files: These contain extracted features necessary for voice cloning

Voice Format Requirements

For best results:

Sample rate: 24kHz (will be automatically converted if different)
Audio format: WAV recommended, but MP3, M4A, and other formats are supported
Duration: 5-24 seconds of clear speech
Quality: Clean recording with minimal background noise

Parameter Tuning

Controlling Voice Accent

This model offers excellent control over accents and pronunciation:

For preserving the speaker's accent:
- Set pronunciation_strength (p_w) to a lower value (1.0-1.5)
- This is useful for cross-lingual TTS where you want to preserve the accent
For standard pronunciation:
- Set pronunciation_strength (p_w) to a higher value (2.5-4.0)
- This helps produce more standard pronunciation regardless of the source accent
For emotional or expressive speech:
- Increase the voice_similarity (t_w) parameter (2.0-5.0)
- Keep pronunciation_strength (p_w) at a moderate level (1.5-2.5)

Recommended Parameter Combinations

| Use Case | p_w (pronunciation_strength) | t_w (voice_similarity) | |----------|------------------------------|------------------------| | Standard TTS | 2.0 | 3.0 | | Preserve Accent | 1.0-1.5 | 3.0-5.0 | | Cross-lingual (standard) | 3.0-4.0 | 3.0-5.0 | | Emotional Speech | 1.5-2.5 | 3.0-5.0 | | Noisy Reference Audio | 3.0-5.0 | 3.0-5.0 |

Nodes

This extension provides three main nodes:

1. Mega TTS (Advanced)

Full-featured TTS node with complete parameter control.

Inputs:

input_text - Text to convert to speech
language - Language selection (en: English, zh: Chinese)
generation_quality - Controls the number of diffusion steps (higher = better quality but slower)
pronunciation_strength (p_w) - Controls pronunciation accuracy (higher values produce more standard pronunciation)
voice_similarity (t_w) - Controls similarity to reference voice (higher values produce speech more similar to reference)
reference_voice - Reference voice file from Voices folder

Outputs:

AUDIO - Generated audio in WAV format
LATENT - Audio latent representation for further processing

2. Mega TTS (Simple)

Simplified TTS node with default parameters for quick usage.

Inputs:

input_text - Text to convert to speech
language - Language selection (en: English, zh: Chinese)
reference_voice - Reference voice file from Voices folder

Outputs:

AUDIO - Generated audio in WAV format

3. Mega TTS (Clean Memory)

Utility node to free GPU memory after TTS processing.

Parameter Descriptions

| Parameter | Description | Recommended Values | |-----------|-------------|-------------------| | generation_quality | Controls the number of diffusion steps. Higher values produce better quality but increase generation time. | Default: 10. Range: 1-50. For quick tests: 1-5, for final output: 15-30. | | pronunciation_strength (p_w) | Controls how closely the output follows standard pronunciation. | Default: 2.0. Range: 1.0-5.0. For accent preservation: 1.0-1.5, for standard pronunciation: 2.5-4.0. | | voice_similarity (t_w) | Controls how similar the output is to the reference voice. | Default: 3.0. Range: 1.0-5.0. For more expressive output with preserved voice characteristics: 3.0-5.0. |

Voice Cloning

Adding Reference Voices

Place your voice WAV files in the Voices folder
Each voice requires two files:
- voice_name.wav - Voice sample file (24kHz sample rate recommended, 5-10 seconds of clear speech)
- voice_name.npy - Corresponding voice feature file (generated automatically if voice extraction is enabled)

How to Clone a Voice

Add your sample WAV file to the Voices folder
The first time you select the voice, the system will extract feature files and save them
Select your voice in the node's "reference_voice" dropdown
Adjust the "voice_similarity" parameter to control the intensity of voice cloning:
- Lower values (1.0-2.0): More natural but less similar to reference
- Higher values (3.0-5.0): More similar to reference but potentially less natural

Advanced Usage

Cross-Language Voice Cloning

For cloning a voice across languages (e.g., making an English speaker speak Chinese):

Use a clean voice sample in the original language
Set language to the target language (e.g., "zh" for Chinese)
Increase the pronunciation_strength (p_w) parameter (3.0-4.0) for more standard pronunciation
Set voice_similarity (t_w) parameter higher (3.0-5.0) to maintain voice characteristics

Handling Accents

For preserving accents: Lower pronunciation_strength (p_w) value (1.0-1.5)
For standard pronunciation: Higher pronunciation_strength (p_w) value (2.5-4.0)

Credits

Original MegaTTS3 model by ByteDance
MegaTTS3 Hugging Face model: ByteDance/MegaTTS3

License

GPL-3.0 License

ComfyUI Extension: ComfyUI-MegaTTS

Custom Nodes (0)

README

ComfyUI-MegaTTS

Update Logs

Version 1.0.2

Version 1.0.1

Features

Installation

Prerequisites

Steps

Models and Manual Download

Model Structure

Manual Download Options

Voice Folder and Voice Maker

Voice Folder Structure

End of Selection

Voice Folder Structure

Getting Voice Samples and NPY Files

Voice Maker Node

About WAV and NPY Files

Voice Format Requirements

Parameter Tuning

Controlling Voice Accent

Recommended Parameter Combinations

Nodes

1. Mega TTS (Advanced)

2. Mega TTS (Simple)

3. Mega TTS (Clean Memory)

Parameter Descriptions

Voice Cloning

Adding Reference Voices

How to Clone a Voice

Advanced Usage

Cross-Language Voice Cloning

Handling Accents

Credits

License

References