ComfyUI Extension: ComfyUI Ovi Node

Authored by HM-RunningHub

Created 2 months ago

Updated 2 months ago

45 stars

A custom node for ComfyUI that integrates Ovi for synchronized video+audio generation from text or image inputs.

Custom Nodes (0)

README

ComfyUI Ovi Node

A custom node for ComfyUI that integrates Ovi for synchronized video+audio generation from text or image inputs.

✨ Features

🎬 Joint Video+Audio Generation: Generate synchronized video and audio content simultaneously
📝 Text-to-Video+Audio: Create videos from text prompts with speech and sound effects
🖼️ Image-to-Video+Audio: Generate videos from image and text inputs
⏱️ 5-Second Videos: 24 FPS, 720×720 area, multiple aspect ratios (9:16, 16:9, 1:1, etc)
⚙️ Memory Optimization: FP8 precision + CPU offload for 24GB VRAM GPUs
🚀 Flexible Control: Advanced parameter control for quality fine-tuning

🔧 Node List

Core Nodes

RunningHub Ovi Model Loader: Load and initialize Ovi engine with optimization options
RunningHub Ovi Text to Video: Generate video+audio from text prompts
RunningHub Ovi Image to Video: Generate video+audio from image and text inputs

🚀 Quick Installation

Step 1: Install the Node

# Navigate to ComfyUI custom_nodes directory
cd ComfyUI/custom_nodes/

# Clone the repository
git clone https://github.com/HM-RunningHub/ComfyUI_RH_Ovi.git

cd ComfyUI_RH_Ovi

# Install dependencies
pip install -r requirements.txt

# Install Flash Attention
pip install flash_attn --no-build-isolation

Step 2: Download Required Models

# Download models (will download to ComfyUI/models/Ovi by default)
python download_weights.py

# Download fp8 quantized model (for 24GB VRAM mode)
cd ../../models/Ovi
wget -O "model_fp8_e4m3fn.safetensors" \
  "https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
cd ../../custom_nodes/ComfyUI_RH_Ovi

# Final model structure should look like:
# ComfyUI/models/Ovi/
# ├── MMAudio/
# │   └── ext_weights/
# │       ├── best_netG.pt
# │       └── v1-16.pth
# ├── Ovi/
# │   ├── model.safetensors
# │   └── model_fp8_e4m3fn.safetensors
# └── Wan2.2-TI2V-5B/
#     ├── google/umt5-xxl/
#     ├── models_t5_umt5-xxl-enc-bf16.pth
#     └── Wan2.2_VAE.pth

# Restart ComfyUI

📖 Usage

Basic Workflow

[RunningHub Ovi Model Loader] → [RunningHub Ovi Text to Video] → [Save/Preview Video]

Prompt Format

Ovi uses special tags to control speech and audio:

Speech: <S>Your speech content here<E> - Text will be converted to speech
Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes audio/sound effects

Example Prompt:

<S>Hello world!<E> <AUDCAP>Soft piano music playing<ENDAUDCAP>

Generation Types

Text-to-Video+Audio

Connect RunningHub Ovi Model Loader to RunningHub Ovi Text to Video
Input text prompt with speech and audio tags
Set video dimensions, seed, and generation parameters
Generate synchronized video+audio

Image-to-Video+Audio

Load an image using ComfyUI's Load Image node
Connect image and ovi_engine to RunningHub Ovi Image to Video
Input text prompt with speech and audio tags
Generate video+audio based on the image

Example Prompts

Text-to-Video: See example_prompts/gpt_examples_t2v.csv
Image-to-Video: See example_prompts/gpt_examples_i2v.csv

🛠️ Technical Requirements

GPU: 24GB+ VRAM (with CPU offload + FP8 optimization)
- 32GB+ VRAM without optimization
RAM: 32GB+ recommended
Storage: ~30GB for all models
- Ovi models: ~12GB
- MMAudio: ~2GB
- Wan2.2-TI2V-5B: ~13GB
- FP8 quantized model: ~6GB
CUDA: Required for optimal performance

⚠️ Important Notes

Model Paths: Models must be placed in ComfyUI/models/Ovi/ directory
Default Configuration: Model Loader defaults to CPU offload + FP8 for 24GB VRAM
- Disable both for 32GB+ VRAM (better quality, faster inference)
FP8 Model: Required for 24GB VRAM mode (slight quality degradation)
All model files must be downloaded before first use

📄 License

This project is based on the original Ovi project.

🔗 References

⭐ Citation

If you find this project useful, please consider citing the original Ovi paper:

@misc{low2025ovitwinbackbonecrossmodal,
      title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
      author={Chetwin Low and Weimin Wang and Calder Katyal},
      year={2025},
      eprint={2510.01284},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2510.01284}, 
}