ComfyUI Extension: ComfyUI HunyuanVideo-Foley Custom Node

Authored by dasilva333

Created 3 months ago

Updated 3 months ago

7 stars

This custom node integrates the HunyuanVideo-Foley model for generating audio from video frames and text prompts in ComfyUI. It's built for use in generating Foley sounds from video and text inputs.

Custom Nodes (0)

README

ComfyUI HunyuanVideo-Foley Custom Node

Hunyuan Video Foley Custom Node Screenshot

This custom node integrates the HunyuanVideo-Foley model for generating audio from video frames and text prompts in ComfyUI. It's built for use in generating Foley sounds from video and text inputs.

Here’s the updated ## Installation section:

Installation

Step 1: Set Up Your Custom Node

Navigate to the custom_nodes directory:
```
cd custom_nodes
```

Clone this project (ComfyUI HunyuanVideo-Foley):

git clone https://github.com/dasilva333/ComfyUI_HunyuanVideo-Foley.git

Navigate into the cloned custom node directory:
```
cd ComfyUI_HunyuanVideo-Foley
```

Step 2: Clone HunyuanVideo-Foley Repository

Clone the HunyuanVideo-Foley repository into your node folder:
```
git clone https://huggingface.co/tencent/HunyuanVideo-Foley
```
Navigate into the HunyuanVideo-Foley directory:
```
cd HunyuanVideo-Foley
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Step 3: Clone the Pretrained Model

Clone the pretrained model inside the HunyuanVideo-Foley folder:
- Follow the instructions on the official HunyuanVideo-Foley HuggingFace page to download the model and place it inside this folder.

Usage

Loading the Custom Node in ComfyUI

Once you’ve set up the repository and placed your models, open ComfyUI.
In the Node Graph interface, look for the Hunyuan Foley (Audio) node under the custom nodes section.
You can connect this node to the rest of your pipeline, providing it with video frames (via Load Image or Load Video nodes) and a text prompt.

Input Requirements

Frames: The input frames can come from Load Image, Load Video, or other nodes that produce frames in the form of a tensor (e.g., VAE Decode).
Audio Prompt: The node requires a text prompt (e.g., “suspenseful piano with rising tension”) to generate audio.
Frame Rate: Specify the frame rate of the input frames (e.g., 8.0 fps).

Example Workflow

You can test the node with a sample workflow JSON. Download the sample JSON file and try it out to see how the node functions in a typical use case.

Troubleshooting

Out of Memory (OOM) Errors

Problem: If you get an OOM error while processing, it likely means your GPU does not have enough memory to process the given video frames.
Solution:
- Downscale the input video frames.
- Reduce the number of inference steps or batch size.
- Use smaller model sizes if available.

Missing FFmpeg Executables

Problem: If you encounter errors related to missing FFmpeg executables, it's likely that the necessary FFmpeg binaries are not present in your python_embedded directory.
Solution:
- Download the FFmpeg essentials and place the binaries in your python_embedded folder. You can download FFmpeg from the official site: FFmpeg Downloads.
- Ensure that the executable files (ffmpeg, ffprop, etc.) are available in the correct directory.

Silent Audio Output

Problem: Sometimes the DAC model might return an all-zero waveform.
Solution: Check the guidance_scale and ensure the text prompt is sufficient to guide the generation process. Adjust the prompt if necessary to achieve more diverse results.

Important Notes

VRAM: The model was tested on a GPU with 8GB of VRAM. For 10 inference steps, expect processing times around 4 minutes 5 seconds.
Device Compatibility: The node requires a CUDA-capable GPU for efficient inference, but it can fall back to CPU if necessary (though performance will be significantly slower).

Contributing

Feel free to open issues or contribute to the repository if you have suggestions or improvements. Pull requests are always welcome!

License

This project is licensed under the MIT License - see the LICENSE file for details.