ComfyUI Extension: ComfyUI HunyuanVideo-Foley
This is a ComfyUI custom node wrapper for the HunyuanVideo-Foley model, which generates realistic audio from video and text descriptions.
Custom Nodes (0)
README
ComfyUI HunyuanVideo-Foley Custom Node
This is a ComfyUI custom node wrapper for the HunyuanVideo-Foley model, which generates realistic audio from video and text descriptions.
Features
- Text-Video-to-Audio Synthesis: Generate realistic audio that matches your video content
- Flexible Text Prompts: Use optional text descriptions to guide audio generation
- Multiple Samples: Generate up to 6 different audio variations per inference
- Configurable Parameters: Control guidance scale, inference steps, and sampling
- Seed Control: Reproducible results with seed parameter
- Model Caching: Efficient model loading and reuse across generations
- Automatic Model Downloads: Models are automatically downloaded to
ComfyUI/models/foley/
when needed <img width="2560" height="1440" alt="image" src="https://github.com/user-attachments/assets/cace6b70-0eb7-4eda-a4f5-c21c95559b38" />
Features
- Text-Video-to-Audio Synthesis: Generate realistic audio that matches your video content
- Flexible Text Prompts: Use optional text descriptions to guide audio generation
- Multiple Samples: Generate up to 6 different audio variations per inference
- Configurable Parameters: Control guidance scale, inference steps, and sampling
- Seed Control: Reproducible results with seed parameter
- Model Caching: Efficient model loading and reuse across generations
- Automatic Model Downloads: Models are automatically downloaded to
ComfyUI/models/foley/
when needed
Installation
-
Clone this repository into your ComfyUI custom_nodes directory:
cd ComfyUI/custom_nodes git clone https://github.com/if-ai/ComfyUI_HunyuanVideoFoley.git
-
Install dependencies:
cd ComfyUI_HunyuanVideoFoley pip install -r requirements.txt
-
Run the installation script (recommended):
python install.py
-
Restart ComfyUI to load the new nodes.
Model Setup
The models can be obtained in two ways:
Option 1: Automatic Download (Recommended)
- Models will be automatically downloaded to
ComfyUI/models/foley/
when you first run the node - No manual setup required
- Progress will be shown in the ComfyUI console
Option 2: Manual Download
- Download models from HuggingFace
- Place models in
ComfyUI/models/foley/
(recommended) or./pretrained_models/
directory - Ensure the config file is at
configs/hunyuanvideo-foley-xxl.yaml
Operation Guide: How to Use the Nodes
This custom node package is designed in a modular way for maximum flexibility and efficiency. Here is the recommended workflow and an explanation of what each node does.
Recommended Workflow
The most powerful and efficient way to use these nodes is to chain them together in the following order:
Model Loader
→ Dependencies Loader
→ Torch Compile
→ Generator (Advanced)
This setup allows you to load the models only once, apply performance optimizations, and then run the generator multiple times without reloading, saving significant time and VRAM.
Node Details
1. HunyuanVideo-Foley Model Loader (FP8)
This is the starting point. It loads the main (and very large) audio generation model into memory.
- quantization: This is the most important setting for saving VRAM.
none
: Loads the model in its original format (highest VRAM usage).fp8_e5m2
/fp8_e4m3fn
: These options use FP8 quantization, a technique that stores the model's weights in a much smaller format. This can save several gigabytes of VRAM with a minimal impact on audio quality, making it possible to run on GPUs with less memory.
- cpu_offload: If
True
, the model will be kept in your regular RAM instead of VRAM. This is not the same as the generator's offload setting; use this if you are loading multiple different models in your workflow and need to conserve VRAM.
2. HunyuanVideo-Foley Dependencies
This node takes the main model from the loader and then loads all the smaller, auxiliary models required for the process (the VAE, text encoder, and visual feature extractors).
3. HunyuanVideo-Foley Torch Compile
This is an optional but highly recommended performance-enhancing node. It uses torch.compile
to optimize the model's code for your specific hardware.
-
Note: The very first time you run a workflow with this node, it will take a minute or two to perform the compilation. However, every subsequent run will be significantly faster (often 20-30%).
-
compile_mode
: This controls the trade-off between compilation time and the amount of performance gain.default
: The best balance. It provides a good speedup with a reasonable initial compile time.reduce-overhead
: Compiles more slowly but can reduce the overhead of running the model, which might be faster for very small audio generations.max-autotune
: Takes the longest to compile initially, but it tries many different optimizations to find the absolute fastest option for your specific hardware.
-
backend
: This is an advanced setting that changes the underlying compiler used by PyTorch. For most users, the defaultinductor
is the best choice.
4. HunyuanVideo-Foley Generator (Advanced)
This is the main workhorse node where the audio generation happens.
- video / images: Your visual input. You can provide either a video file or a batch of images from another node.
- compiled_model: The input for the model prepared by the upstream nodes.
- text_prompt / negative_prompt: Your descriptions of the sound you want (and don't want).
- guidance_scale / num_inference_steps / seed: Standard diffusion model controls for creativity vs. prompt adherence, quality vs. speed, and reproducibility.
- enabled: A simple switch. If
False
, the node does nothing and passes through an empty/silent output. This is useful for disabling parts of a complex workflow without having to disconnect them. - silent_audio: Controls what happens when the node is disabled or fails. If
True
, it outputs a valid, silent audio clip, which prevents downstream nodes (like video combiners) from failing. IfFalse
, it outputsNone
.
Understanding the Memory Options
The two memory-related checkboxes on the Generator node are crucial for managing your GPU's resources. Here is exactly what they do:
-
cpu_offload
:- What it does: If this is
True
, the node will always move the models to your regular RAM (CPU) after the generation is complete. This is the best option for freeing up VRAM for other nodes in your workflow while still keeping the models ready for the next run without having to reload them from disk. - Use this when: You want to run other VRAM-intensive nodes after this one and plan to come back to the Foley generator later.
- What it does: If this is
-
memory_efficient
:- What it does: This is a more aggressive option. If
True
, the node will completely unload the models from memory (both VRAM and RAM) after the generation is finished. - Important Distinction: This process is smart. It will only unload the model if it was loaded by the generator node itself (the simple workflow). If the model was passed in from the
HunyuanVideoFoleyModelLoader
(the advanced workflow), it will not unload it, respecting the fact that you may want to reuse the pre-loaded model for another generation. - Use this when: You are finished with audio generation and want to free up as much memory as possible for completely different tasks.
- What it does: This is a more aggressive option. If
Performance Tuning & VRAM Usage
The most memory-intensive part of the process is visual feature extraction. We've implemented batched processing to prevent out-of-memory errors with longer videos or on GPUs with less VRAM. You can control this with two settings on the Generator (Advanced) node:
-
feature_extraction_batch_size
: This determines how many video frames are processed by the feature extractor models at once.- Lower values significantly reduce peak VRAM usage at the cost of slightly slower processing.
- Higher values speed up processing but require more VRAM.
-
enable_profiling
: If you check this box, the node will print detailed performance timings and peak VRAM usage for the feature extraction step to the console. This is highly recommended for finding the optimal batch size for your specific hardware.
Recommended Batch Sizes
These are general starting points. The optimal value can vary based on your exact GPU, driver version, and other running processes.
| VRAM Tier | Video Resolution | Recommended Batch Size | Notes | | :--- | :--- | :--- | :--- | | ≤ 8 GB | 480p | 4 - 8 | Start with 4. If successful, you can try increasing it. | | | 720p | 2 - 4 | Start with 2. 720p videos are demanding on low VRAM cards. | | 12-16 GB | 480p | 16 - 32 | The default of 16 should work well. Can be increased for more speed. | | | 720p | 8 - 16 | Start with 8 or 16. | | ≥ 24 GB| 480p | 32 - 64 | You can safely increase the batch size for maximum performance. | | | 720p | 16 - 32 | A batch size of 32 should be easily achievable. |
Usage
Node Types
1. HunyuanVideo-Foley Generator
Main node for generating audio from video and text.
Inputs:
- video: Video input (VIDEO type)
- text_prompt: Text description of desired audio (STRING)
- guidance_scale: CFG scale for generation control (1.0-10.0, default: 4.5)
- num_inference_steps: Number of denoising steps (10-100, default: 50)
- sample_nums: Number of audio samples to generate (1-6, default: 1)
- seed: Random seed for reproducibility (INT)
- model_path: Path to pretrained models (optional, leave empty for auto-download)
- enabled: Enable or disable the entire node. If disabled, it will pass through a silent or null audio output without processing. (BOOLEAN, default: True)
- silent_audio: Controls the output when the node is disabled or fails. If true, it outputs a silent audio clip. If false, it outputs
None
. (BOOLEAN, default: True)
Outputs:
- video_with_audio: Video with generated audio merged (VIDEO)
- audio_only: Generated audio file (AUDIO)
- status_message: Generation status and info (STRING)
⚠ Important Limitations
Frame Count & Duration Limits
- Maximum Frames: 450 frames (hard limit)
- Maximum Duration: 15 seconds at 30fps
- Recommended: Keep videos ≤15 seconds for best results
FPS Recommendations
- 30fps: Max 15 seconds (450 frames)
- 24fps: Max 18.75 seconds (450 frames)
- 15fps: Max 30 seconds (450 frames)
Long Video Solutions
For videos longer than 15 seconds:
- Reduce FPS: Lower FPS allows longer duration within frame limit
- Segment Processing: Split long videos into 15s segments
- Audio Merging: Combine generated audio segments in post-processing
Example Workflow
- Load Video: Use a "Load Video" node to input your video file
- Add Generator: Add the "HunyuanVideo-Foley Generator" node
- Connect Video: Connect the video output to the generator's video input
- Set Prompt: Enter a text description (e.g., "A person walks on frozen ice")
- Adjust Settings: Configure guidance scale, steps, and sample count as needed
- Generate: Run the workflow to generate audio
Model Requirements
The node expects the following model structure:
ComfyUI\models\foley\hunyuanvideo-foley-xxl
├── hunyuanvideo_foley.pth # Main Foley model
├── vae_128d_48k.pth # DAC VAE model
└── synchformer_state_dict.pth # Synchformer model
configs/
└── hunyuanvideo-foley-xxl.yaml # Configuration file
TODO
- [x] ADD VHS INPUT/OUTPUTS (Thanks to YC)
- [x] NEGATIVE PROMPT (Thanks to YC)
- [x] MODEL OFFLOADING OPS
- [x] TORCH COMPILE
- [ ] QUANTISE MODEL
Support
If you find this tool useful, please consider supporting my work by:
- Starring this repository on GitHub
- Subscribing to my YouTube channel: Impact Frames
- Following on X: @ImpactFrames
You can also support by reporting issues or suggesting features. Your contributions help me bring updates and improvements to the project.
License
This custom node is based on the HunyuanVideo-Foley project. Please check the original project's license terms.
Credits
Based on the HunyuanVideo-Foley project by Tencent. Original paper and code available at:
-
Paper: [HunyuanVideo-Foley: Text-Video-to-Audio Synthesis]
-
Code: [https://github.com/tencent/HunyuanVideo-Foley]