ComfyUI Extension: ComfyUI DMOSpeech2 Node

Authored by HM-RunningHub

Created 3 months ago

Updated 3 months ago

12 stars

This is a Seed-X-PPO-7B ComfyUI plugin. Easy to use

Custom Nodes (0)

README

ComfyUI DMOSpeech2 Node

A ComfyUI custom node implementation of DMOSpeech2 - Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis.

Features

High-quality text-to-speech synthesis with metric optimization
Reinforcement learning-based duration prediction
Teacher-guided sampling for improved diversity
Support for multi-speaker speech generation
Zero-shot voice cloning capabilities

Installation

Navigate to your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes/

Clone this repository:

git clone https://github.com/HM-RunningHub/ComfyUI_RH_DMOSpeech2.git

Install requirements:

cd ComfyUI_RH_DMOSpeech2
pip install -r requirements.txt

Restart ComfyUI

Model Download

You need to download 4 model components to the following directories:

1. DMOSpeech2 Main Models

Create ckpts directory and download:

cd models/DMOSpeech2
mkdir ckpts
cd ckpts
wget https://huggingface.co/yl4579/DMOSpeech2/resolve/main/model_85000.pt
wget https://huggingface.co/yl4579/DMOSpeech2/resolve/main/model_1500.pt

2. Emilia Vocabulary

Create Emilia_ZH_EN_pinyin directory and download vocab file:

mkdir Emilia_ZH_EN_pinyin
# Download vocab.txt to this directory
https://huggingface.co/spaces/mrfakename/E2-F5-TTS/blob/27cee60c0890d22dab124730a73d5453fc8359a5/data/Emilia_ZH_EN_pinyin/vocab.txt

3. Vocos Vocoder

Download Vocos mel vocoder model:

# Create vocos-mel-24khz directory and download required files
# config.yaml, pytorch_model.bin, etc.
https://huggingface.co/charactr/vocos-mel-24khz/tree/main

4. Whisper Model

Download Whisper large v3 turbo model:

# Create whisper-large-v3-turbo directory and download required files
# All tokenizer and model files
https://huggingface.co/openai/whisper-large-v3-turbo/tree/main

Usage

Load the DMOSpeech2 node in ComfyUI
Connect text input and reference audio (for voice cloning)
Configure generation parameters
Generate high-quality speech output

Models Directory Structure

DMOSpeech2/
├── ckpts/
│   ├── model_1500.pt           # Duration predictor
│   └── model_85000.pt          # Main DMOSpeech2 model
├── Emilia_ZH_EN_pinyin/
│   └── vocab.txt               # Vocabulary file
├── vocos-mel-24khz/            # Vocos vocoder
│   ├── config.yaml
│   ├── pytorch_model.bin
│   └── ...
├── whisper-large-v3-turbo/     # Whisper ASR model
│   ├── config.json
│   ├── model.safetensors
│   └── ...


## Credits

- Original DMOSpeech2: [yl4579/DMOSpeech2](https://github.com/yl4579/DMOSpeech2)
- Based on F5-TTS architecture
- Implements GRPO (Group Relative Preference Optimization) for duration prediction

## License

MIT License