ComfyUI Extension: ComfyUI_IndexTTS

Authored by billwuhao

Created 3 months ago

Updated 2 months ago

105 stars

IndexTTS Voice Cloning Nodes for ComfyUI. High-quality voice cloning, very fast, supports Chinese and English, and allows custom voice styles.

Custom Nodes (0)

README

中文 | English

IndexTTS Voice Cloning Node for ComfyUI

Very high voice cloning quality, extremely fast, supports Chinese and English, and custom voice tones.

📣 Updates

[2025-05-30]⚒️: Released v1.2.0. Supports two-person dialogue, speaker preview, normal pynini installation on Windows, no longer a crippled TTS version!

IndexTTS 正式发布1.5 版本了，效果666,晕XUAN4是一种GAN3觉,我爱你！,I love you!,“我爱你”的英语是“I love you”,2.5平方电线,共465篇，约315万字,2002年的第一场雪，下在了2003年.

https://github.com/user-attachments/assets/b67891f2-0982-4540-8c3b-1a870305466f

[2025-05-14]⚒️: Supports v1.5. Download and rename models to the ComfyUI\models\TTS\Index-TTS path:

https://huggingface.co/IndexTeam/IndexTTS-1.5/blob/main/bigvgan_generator.pth → bigvgan_generator_v1_5.pth
https://huggingface.co/IndexTeam/IndexTTS-1.5/blob/main/bpe.model → bpe_v1_5.model
https://huggingface.co/IndexTeam/IndexTTS-1.5/blob/main/gpt.pth → gpt_v1_5.pth

[2025-05-02]⚒️: DeepSpeed acceleration available, requires DeepSpeed installation. For Windows, see DeepSpeed Installation. Acceleration is not significant.

[2025-04-30]⚒️: Released v1.0.0.

Usage

Important parameter descriptions (other less important parameters will not be introduced one by one):

max_mel_tokens: Controls the length of the generated speech. This parameter needs to be increased for long texts.
max_text_tokens_per_sentence: Maximum number of tokens per sentence. Smaller values lead to faster inference speed, but consume more memory and might affect quality.
sentences_bucket_max_size: Maximum capacity for sentence bucketing. Larger values lead to faster inference speed, but consume more memory and might affect quality.
fast_inference: Enable fast inference.
custom_cuda_kernel: Enable custom CUDA kernel. The CUDA kernel extension will be built automatically on the first run.
dialogue_audio_s2: The second audio for two-person dialogue. If this audio is input, dialogue mode will be automatically enabled. In dialogue mode, the input text must be as follows ([S1] indicates the first speaker, [S2] indicates the second speaker):

[S1] 轻喘像风掠过耳畔， 
[S2] 你靠近时，连呼吸都慢了半拍。
[S1] 指尖在我锁骨上游移， 
[S2] 仿佛试探一扇未曾开启的门。

Loading Audio:

Preview Speaker:

I will unify all speaker audios for TTS nodes into the ComfyUI\models\TTS\speakers path. These nodes include IndexTTS, CSM, Dia, KokoroTTS, MegaTTS, QuteTTS, SparkTTS, StepAudioTTS, etc.

Two-person Dialogue:

Installation

Windows: First, install the following dependencies:

Download the pynini wheel for the corresponding Python version from pynini-windows-wheels.

Example:

D:\AIGC\python\py310\python.exe -m pip install pynini-2.1.6.post1-cp3xx-cp3xx-win_amd64.whl
D:\AIGC\python\py310\python.exe -m pip install importlib_resources
D:\AIGC\python\py310\python.exe -m pip install WeTextProcessing>=1.0.4 --no-deps

Linux, Mac, Windows:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_IndexTTS.git
cd ComfyUI_IndexTTS
pip install -r requirements.txt

# python_embeded
./python_embeded/python.exe -m pip install -r requirements.txt

Model Download

Models need to be manually downloaded to the ComfyUI\models\TTS\Index-TTS path:

The Index-TTS structure is as follows:

bigvgan_generator.pth
bpe.model
gpt.pth

Acknowledgements

index-tts