ComfyUI Extension: ComfyUI-Hunyuan-Image-3

Authored by bgreene2

Created 2 months ago

Updated about a month ago

25 stars

A ComfyUI custom node that provides a basic implementation of Hunyuan Image 3.0 using Transformers.

Custom Nodes (0)

README

ComfyUI Hunyuan Image 3.0 custom node

This is a custom node that allows for basic image generation using Hunyuan Image 3.0.

Node screenshot

Features

Supports CPU and disk offload to allow generation on consumer setups
- When using CPU offload, weights are stored in system RAM and transferred to the GPU as needed for processing
- When using disk offload, weights are stored in system RAM and on disk and transferred to the GPU as needed for processing

Installation

You can find the node on the ComfyUI registry, or you can install it manually.

Manual installation steps:

Navigate to your custom_nodes folder
Clone this repo: git clone https://github.com/bgreene2/ComfyUI-HunyuanImage3.git
Change to the directory cd ComfyUI-HunyuanImage3
Assuming the correct Python environemnt is loaded, install dependencies pip install -r requirements.txt
If pytorch is not already installed, install pytorch: pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
(optional) Add the flag --disable-cuda-malloc to your ComfyUI startup script.

Post-installation steps

Download the model weights

Usage Guide

The node has the following inputs:

prompt - Your text prompt

The node has the following parameters:

seed - The seed for the random number generator
control after generate - This is attached to the 'seed' parameter
use_dimensions - Whether to use the below user-specified width and height values. If false, the model will automatically choose the image dimensions. Note: the model has a few discrete image sizes that it supports. So even if you specify a width and height, the actual image dimensions used may be different, but should have a similar ratio.
width - Image width
height - Image height
steps - Number of steps. The recommended value is 50.
guidance_scale - CFG. The recommended value is 7.5.
attn_implementation - Valid values are sdpa and flash_attention_2
moe_impl - Valid values are eager and flashinfer
weights_folder - The path to the model's weights on disk.
use_offload - Whether to use CPU / disk offload.
disk_offload_layers - The number of layers (out of 32) to offload to disk, rather than hold in memory.
keep_model_in_memory - If enabled, the model will be kept in memory between generations, instead of being unloaded.
device_map_overrides - You can modify the custom device_map using this. Overrides are expressed as key=value pairs, comma-separated. See the Performance Tuning section for more details.
moe_drop_tokens - Enables the moe_drop_tokens parameter on model loading. See the Memory Troubleshooting section for more details.

Basic usage: Connect a String (Multiline) input to the prompt input, and connect the Image output to a Save Image node. An example workflow is provided.

Recommended Usage

This model works best with detailed prompts. See the HuggingFace page for a prompting guide.

Performance Tuning

If you can fit the entire model in VRAM, you can run with use_offload set to false. This should give you the highest speed possible.

If you can't fit the model in VRAM, you can enable offload by setting use_offload to true. In this case, you will be bottlenecked by the transfer of data from system memory to VRAM to process on your GPU. You will want to have as fast a PCIe connection as possible.

If you can't fit the model in system ram, you can enable disk offload by setting disk_offload_layers to a value above 0. In this case, you will be additionally bottlenecked by the transfer of data from disk. You will want to have as fast a drive as possible.

If you are using disk offload, you need to choose the number of layers to offload such that you still have some physical memory left over, so that the system does not use swap. This may require some trial-and-error while monitoring memory usage. On a system with 128GB of RAM, 10 layers is a good starting point.

When using disk offload with 10 layers, on a system with PCIe 4.0 and an NVMe drive, an image will take on the order of 1-1.5 hours to generate using 50 steps.

Tips:

If you wish to use more of your VRAM, you can set device_map_overrides.

To put everything except for the model's layers onto GPU 0, you would set vision_model=0,vision_aligner=0,timestamp_emb=0,patch_embed=0,time_embed=0,final_layer=0,time_embed_2=0,model.wte=0,model.ln_f=0,lm_head=0.
To start putting layers onto GPU0, you would add model.layers.0=0,model.layers.1=0,...
You can also spread the load across GPUs, e.g. model.layers.2=1,model.layers.3=1 would put layers 2 and 3 onto GPU 1.

Quantization

For experimental quantization support, use the quantization branch.

Memory Troubleshooting

If you are getting crashes due to running out of GPU memory, there are some things you could try:

Enable moe_drop_tokens.

This stabilizes memory usage dramatically. It does alter the output image, but quality is not affected much. See this comment.

If you don't wish to do this, or if you are still running out of memory, you could try the following:

Set Sysmem Fallback Policy in the Nvidia control panel (only available on Windows platforms - Nvidia's Linux drivers don't have this feature)
Run ComfyUI with --disable-cuda-malloc

This is where the option is found in the Nvidia control panel: CUDA Sysmem Fallback

Contributing

We welcome your contributions. You can:

Make a Pull Request
- Improvements? Fixes? New features? Let's get your creation merged.
Post an issue
- You can add a new issue on the issues tab
Participate in discussions
- You can chat with others in the discussions tab
Make a new Hunyuan Image 3.0 ComfyUI node or fork this repo
- If your version is sufficiently maintained, I will archive this repo and link to yours.