ComfyUI Extension: ComfyUI-DiaTTS
An implementation of Nari-Labs Dia TTS
Custom Nodes (0)
README
ComfyUI Dia TTS Nodes
This is an experimental WIP node pack that integrates the Nari-Labs Dia 1.6b text-to-speech model into ComfyUI using safetensors.
Dia allows generating dialogue with speaker tags ([S1], [S2]) and non-verbal sounds ((laughs), etc.). It also supports audio prompting for voice cloning or style transfer.
It requires a CUDA-enabled GPU.
Note: This version is specifically configured for the nari-labs/Dia-1.6B model architecture.

Installation
- Ensure you have a CUDA-enabled GPU and the necessary NVIDIA drivers installed.
- Download the Dia-1.6B model safetensors file from Hugging Face:
- Model Page: https://huggingface.co/nari-labs/Dia-1.6B
- Direct Download URL: https://huggingface.co/nari-labs/Dia-1.6B/resolve/main/model.safetensors?download=true
- Place the downloaded
.safetensorsfile into your ComfyUIdiffusion_modelsdirectory (e.g.,ComfyUI/models/diffusion_models/). - You might want to rename the file to
Dia-1.6B.safetensorsfor clarity. - Navigate to your
ComfyUI/custom_nodes/directory. - Clone this repository:
Alternatively, download the ZIP and extract it intogit clone https://github.com/BobRandomNumber/ComfyUI-DiaTTS.gitcustom_nodes. - Install the required dependencies:
- Activate ComfyUI's Python environment (e.g.,
source ./venv/bin/activateor.\venv\Scripts\activateon Windows). - Navigate to the node directory:
cd ComfyUI/custom_nodes/ComfyUI-DiaTTS - Install requirements:
pip install -r requirements.txt
- Activate ComfyUI's Python environment (e.g.,
- Restart ComfyUI.
Nodes
Dia 1.6b Loader (DiaLoader)
Loads the Dia-1.6B TTS model from a local .safetensors file located in your diffusion_models directory. Loads the model weights and the required DAC codec onto the GPU. Caches the loaded model to speed up subsequent runs with the same checkpoint.
Inputs:
ckpt_name: Dropdown list of found.safetensorsfiles within yourdiffusion_modelsdirectory. Select the file corresponding to the Dia-1.6B model.
Outputs:
dia_model: A customDIA_MODELobject containing the loaded Dia model instance, ready for theDiaGeneratenode.
Dia TTS Generate (DiaGenerate)
Generates audio using a pre-loaded Dia model provided by the DiaLoader node. Displays a progress bar during generation. Supports optional audio prompting.
Inputs:
dia_model: TheDIA_MODELoutput from theDiaLoadernode.text: The main text transcript to generate audio for. Use[S1],[S2]for speaker turns and parentheses for non-verbals like(laughs). If usingaudio_prompt, this input MUST contain the transcript of the audio prompt first, followed by the text you want to generate.max_tokens: Maximum number of audio tokens to generate (controls length). Default is 1720. Max usable is 3072.cfg_scale: Classifier-Free Guidance scale. Higher values increase adherence to the text. (Default: 3.0)temperature: Sampling temperature. Lower values are more deterministic, higher values increase randomness. (Default: 1.3)top_p: Nucleus sampling probability. Filters vocabulary to most likely tokens. (Default: 0.95)cfg_filter_top_k: Top-K filtering applied during CFG. (Default: 35)speed_factor: Adjusts the speed of the generated audio (1.0 = original speed). (Default: 0.94)seed: Random seed for reproducibility.audio_prompt(Optional): AnAUDIOinput (e.g., from aLoadAudionode) to condition the generation, enabling voice cloning or style transfer.
Outputs:
audio: The generated audio (AUDIOformat:{'waveform': tensor[B, C, T], 'sample_rate': sr}), ready to be saved or previewed. Sample rate is always 44100 Hz.
Usage Example
Basic Generation
- Add the
Dia 1.6b Loadernode (audio/DiaTTS). - Select your Dia model file (e.g.,
Dia-1.6B.safetensors) from theckpt_namedropdown. - Add the
Dia TTS Generatenode (audio/DiaTTS). - Connect the
dia_modeloutput of the Loader to thedia_modelinput of the Generate node. - Enter your dialogue script into the
textinput on the Generate node (e.g.,[S1] Hello ComfyUI! [S2] This is Dia speaking. (laughs)). - Adjust generation parameters as needed.
- Connect the
audiooutput of the Generate node to aSaveAudioorPreviewAudionode. - Queue the prompt.
Generation with Audio Prompt (Voice Cloning)
- Set up the
DiaLoaderas above. - Add a
LoadAudionode and load the.wavor.mp3file containing the voice you want to clone. - Add the
Dia TTS Generatenode. - Connect
dia_modelfrom Loader to Generate node. - Connect the
AUDIOoutput ofLoadAudioto theaudio_promptinput of the Generate node. - Crucially: In the
textinput of theDia TTS Generatenode, you must provide the transcript of the audio prompt first, followed by the new text you want generated in that voice.- Example
textinput:[S1] This is the exact transcript of the audio file I loaded into LoadAudio. [S2] It has the voice characteristics I want. (clears throat) [S1] Now generate this new sentence using that voice. [S2] This part will be synthesized.
- Example
- Adjust other generation parameters. Note that
cfg_scale,temperature, etc., will affect how closely the generation follows the style of the prompt vs the text content. - Connect the
audiooutput toSaveAudioorPreviewAudio. - Queue the prompt. The output audio should only contain the synthesized part (the text after the prompt transcript).
Features
- Generate dialogue via
[S1],[S2]tags. - Generate non-verbal sounds like
(laughs),(coughs), etc.- Supported tags:
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles). Recognition may vary.
- Supported tags:
- Audio Prompting: Use an audio file and its transcript to guide voice style/cloning for new text generation.
Notes
- This node pack requires a CUDA-enabled GPU.
- The
.safetensorsweights file for Dia-1.6B is required. - The first time you run the node, the
descript-audio-codecmodel will be downloaded automatically. Subsequent runs will be faster. - Dependency
descript-audio-codecmust be installed viarequirements.txt. - When using
audio_prompt, ensure the providedtextinput correctly includes the prompt's transcript first. The model uses this text alignment to understand the audio prompt. - WARNING
descript-audio-codecwill installprotobuf >= 3.19.6, != 4.24.0, < 5.0.0as a subdependancy. A user reported it installed 3.19.6 and caused errors in other nodes. This would occur if an old version is in PIP cache, if none were it would install 4.25.8.