ComfyUI Extension: ComfyUI Ovi Node
A custom node for ComfyUI that integrates Ovi for synchronized video+audio generation from text or image inputs.
Custom Nodes (0)
README
ComfyUI Ovi Node
A custom node for ComfyUI that integrates Ovi for synchronized video+audio generation from text or image inputs.
✨ Features
- 🎬 Joint Video+Audio Generation: Generate synchronized video and audio content simultaneously
- 📝 Text-to-Video+Audio: Create videos from text prompts with speech and sound effects
- 🖼️ Image-to-Video+Audio: Generate videos from image and text inputs
- ⏱️ 5-Second Videos: 24 FPS, 720×720 area, multiple aspect ratios (9:16, 16:9, 1:1, etc)
- ⚙️ Memory Optimization: FP8 precision + CPU offload for 24GB VRAM GPUs
- 🚀 Flexible Control: Advanced parameter control for quality fine-tuning
🔧 Node List
Core Nodes
- RunningHub Ovi Model Loader: Load and initialize Ovi engine with optimization options
- RunningHub Ovi Text to Video: Generate video+audio from text prompts
- RunningHub Ovi Image to Video: Generate video+audio from image and text inputs
🚀 Quick Installation
Step 1: Install the Node
# Navigate to ComfyUI custom_nodes directory
cd ComfyUI/custom_nodes/
# Clone the repository
git clone https://github.com/HM-RunningHub/ComfyUI_RH_Ovi.git
cd ComfyUI_RH_Ovi
# Install dependencies
pip install -r requirements.txt
# Install Flash Attention
pip install flash_attn --no-build-isolation
Step 2: Download Required Models
# Download models (will download to ComfyUI/models/Ovi by default)
python download_weights.py
# Download fp8 quantized model (for 24GB VRAM mode)
cd ../../models/Ovi
wget -O "model_fp8_e4m3fn.safetensors" \
"https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
cd ../../custom_nodes/ComfyUI_RH_Ovi
# Final model structure should look like:
# ComfyUI/models/Ovi/
# ├── MMAudio/
# │ └── ext_weights/
# │ ├── best_netG.pt
# │ └── v1-16.pth
# ├── Ovi/
# │ ├── model.safetensors
# │ └── model_fp8_e4m3fn.safetensors
# └── Wan2.2-TI2V-5B/
# ├── google/umt5-xxl/
# ├── models_t5_umt5-xxl-enc-bf16.pth
# └── Wan2.2_VAE.pth
# Restart ComfyUI
📖 Usage
Basic Workflow
[RunningHub Ovi Model Loader] → [RunningHub Ovi Text to Video] → [Save/Preview Video]
<img width="1159" height="583" alt="image" src="https://github.com/user-attachments/assets/1cc83355-6253-413e-84e4-ba7582dccdf6" />
Prompt Format
Ovi uses special tags to control speech and audio:
- Speech:
<S>Your speech content here<E>
- Text will be converted to speech - Audio Description:
<AUDCAP>Audio description here<ENDAUDCAP>
- Describes audio/sound effects
Example Prompt:
<S>Hello world!<E> <AUDCAP>Soft piano music playing<ENDAUDCAP>
Generation Types
Text-to-Video+Audio
- Connect
RunningHub Ovi Model Loader
toRunningHub Ovi Text to Video
- Input text prompt with speech and audio tags
- Set video dimensions, seed, and generation parameters
- Generate synchronized video+audio
Image-to-Video+Audio
- Load an image using ComfyUI's
Load Image
node - Connect image and
ovi_engine
toRunningHub Ovi Image to Video
- Input text prompt with speech and audio tags
- Generate video+audio based on the image
Example Prompts
- Text-to-Video: See example_prompts/gpt_examples_t2v.csv
- Image-to-Video: See example_prompts/gpt_examples_i2v.csv
🛠️ Technical Requirements
- GPU: 24GB+ VRAM (with CPU offload + FP8 optimization)
- 32GB+ VRAM without optimization
- RAM: 32GB+ recommended
- Storage: ~30GB for all models
- Ovi models: ~12GB
- MMAudio: ~2GB
- Wan2.2-TI2V-5B: ~13GB
- FP8 quantized model: ~6GB
- CUDA: Required for optimal performance
⚠️ Important Notes
- Model Paths: Models must be placed in
ComfyUI/models/Ovi/
directory - Default Configuration: Model Loader defaults to CPU offload + FP8 for 24GB VRAM
- Disable both for 32GB+ VRAM (better quality, faster inference)
- FP8 Model: Required for 24GB VRAM mode (slight quality degradation)
- All model files must be downloaded before first use
📄 License
This project is based on the original Ovi project.
🔗 References
⭐ Citation
If you find this project useful, please consider citing the original Ovi paper:
@misc{low2025ovitwinbackbonecrossmodal,
title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation},
author={Chetwin Low and Weimin Wang and Calder Katyal},
year={2025},
eprint={2510.01284},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2510.01284},
}