ComfyUI Extension: ComfyUI-DiaTTS
An implementation of Nari-Labs Dia TTS
Custom Nodes (0)
README
ComfyUI Dia TTS Nodes
This is an experimental WIP node pack that integrates the Nari-Labs Dia 1.6b text-to-speech model into ComfyUI using safetensors.
Dia allows generating dialogue with speaker tags ([S1]
, [S2]
) and non-verbal sounds ((laughs)
, etc.). It also supports audio prompting for voice cloning or style transfer.
It requires a CUDA-enabled GPU.
Note: This version is specifically configured for the nari-labs/Dia-1.6B
model architecture.
Installation
- Ensure you have a CUDA-enabled GPU and the necessary NVIDIA drivers installed.
- Download the Dia-1.6B model safetensors file from Hugging Face:
- Model Page: https://huggingface.co/nari-labs/Dia-1.6B
- Direct Download URL: https://huggingface.co/nari-labs/Dia-1.6B/resolve/main/model.safetensors?download=true
- Place the downloaded
.safetensors
file into your ComfyUIdiffusion_models
directory (e.g.,ComfyUI/models/diffusion_models/
). - You might want to rename the file to
Dia-1.6B.safetensors
for clarity. - Navigate to your
ComfyUI/custom_nodes/
directory. - Clone this repository:
Alternatively, download the ZIP and extract it intogit clone https://github.com/BobRandomNumber/ComfyUI-DiaTTS.git
custom_nodes
. - Install the required dependencies:
- Activate ComfyUI's Python environment (e.g.,
source ./venv/bin/activate
or.\venv\Scripts\activate
on Windows). - Navigate to the node directory:
cd ComfyUI/custom_nodes/ComfyUI-DiaTTS
- Install requirements:
pip install -r requirements.txt
- Activate ComfyUI's Python environment (e.g.,
- Restart ComfyUI.
Nodes
Dia 1.6b Loader (DiaLoader
)
Loads the Dia-1.6B TTS model from a local .safetensors
file located in your diffusion_models
directory. Loads the model weights and the required DAC codec onto the GPU. Caches the loaded model to speed up subsequent runs with the same checkpoint.
Inputs:
ckpt_name
: Dropdown list of found.safetensors
files within yourdiffusion_models
directory. Select the file corresponding to the Dia-1.6B model.
Outputs:
dia_model
: A customDIA_MODEL
object containing the loaded Dia model instance, ready for theDiaGenerate
node.
Dia TTS Generate (DiaGenerate
)
Generates audio using a pre-loaded Dia model provided by the DiaLoader
node. Displays a progress bar during generation. Supports optional audio prompting.
Inputs:
dia_model
: TheDIA_MODEL
output from theDiaLoader
node.text
: The main text transcript to generate audio for. Use[S1]
,[S2]
for speaker turns and parentheses for non-verbals like(laughs)
. If usingaudio_prompt
, this input MUST contain the transcript of the audio prompt first, followed by the text you want to generate.max_tokens
: Maximum number of audio tokens to generate (controls length). Default is 1720. Max usable is 3072.cfg_scale
: Classifier-Free Guidance scale. Higher values increase adherence to the text. (Default: 3.0)temperature
: Sampling temperature. Lower values are more deterministic, higher values increase randomness. (Default: 1.3)top_p
: Nucleus sampling probability. Filters vocabulary to most likely tokens. (Default: 0.95)cfg_filter_top_k
: Top-K filtering applied during CFG. (Default: 35)speed_factor
: Adjusts the speed of the generated audio (1.0 = original speed). (Default: 0.94)seed
: Random seed for reproducibility.audio_prompt
(Optional): AnAUDIO
input (e.g., from aLoadAudio
node) to condition the generation, enabling voice cloning or style transfer.
Outputs:
audio
: The generated audio (AUDIO
format:{'waveform': tensor[B, C, T], 'sample_rate': sr}
), ready to be saved or previewed. Sample rate is always 44100 Hz.
Usage Example
Basic Generation
- Add the
Dia 1.6b Loader
node (audio/DiaTTS
). - Select your Dia model file (e.g.,
Dia-1.6B.safetensors
) from theckpt_name
dropdown. - Add the
Dia TTS Generate
node (audio/DiaTTS
). - Connect the
dia_model
output of the Loader to thedia_model
input of the Generate node. - Enter your dialogue script into the
text
input on the Generate node (e.g.,[S1] Hello ComfyUI! [S2] This is Dia speaking. (laughs)
). - Adjust generation parameters as needed.
- Connect the
audio
output of the Generate node to aSaveAudio
orPreviewAudio
node. - Queue the prompt.
Generation with Audio Prompt (Voice Cloning)
- Set up the
DiaLoader
as above. - Add a
LoadAudio
node and load the.wav
or.mp3
file containing the voice you want to clone. - Add the
Dia TTS Generate
node. - Connect
dia_model
from Loader to Generate node. - Connect the
AUDIO
output ofLoadAudio
to theaudio_prompt
input of the Generate node. - Crucially: In the
text
input of theDia TTS Generate
node, you must provide the transcript of the audio prompt first, followed by the new text you want generated in that voice.- Example
text
input:[S1] This is the exact transcript of the audio file I loaded into LoadAudio. [S2] It has the voice characteristics I want. (clears throat) [S1] Now generate this new sentence using that voice. [S2] This part will be synthesized.
- Example
- Adjust other generation parameters. Note that
cfg_scale
,temperature
, etc., will affect how closely the generation follows the style of the prompt vs the text content. - Connect the
audio
output toSaveAudio
orPreviewAudio
. - Queue the prompt. The output audio should only contain the synthesized part (the text after the prompt transcript).
Features
- Generate dialogue via
[S1]
,[S2]
tags. - Generate non-verbal sounds like
(laughs)
,(coughs)
, etc.- Supported tags:
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
. Recognition may vary.
- Supported tags:
- Audio Prompting: Use an audio file and its transcript to guide voice style/cloning for new text generation.
Notes
- This node pack requires a CUDA-enabled GPU.
- The
.safetensors
weights file for Dia-1.6B is required. - The first time you run the node, the
descript-audio-codec
model will be downloaded automatically. Subsequent runs will be faster. - Dependency
descript-audio-codec
must be installed viarequirements.txt
. - When using
audio_prompt
, ensure the providedtext
input correctly includes the prompt's transcript first. The model uses this text alignment to understand the audio prompt.