ComfyUI Extension: ComfyUI-Qwen2_5-VL

Authored by MakkiShizu

Created 7 months ago

Updated 3 months ago

18 stars

Qwen2.5-VL in ComfyUI

Custom Nodes (0)

README

Qwen2.5-VL in ComfyUI

License: Apache License 2.0 from Qwen2.5-VL

通义千问 2.5-视觉模型 VL 系列。

I have reviewed the existing comfyui-Qwen2.5-VL that are similar to this repo, and it seems that they do not provide native video input. The input method of video_path makes it unfriendly for some workflows, so I have written this repo.

Supports Type:

Image
Multi image
Video
Text Only

most diff: Native Video input support, not PATH

Sometimes when using too small a model or quantization, the model may speak nonsense, but in most cases it is normal. Please refer to the following example image:

Qwen2.5-VL

About Multi image:

You can use any number of images to connect to the BatchImageLoaderToLocalFiles node for inference, or you can upload an animated type image (i.e. multiple batches of images like in the above example image).

Qwen2.5-VL

Or you can freely add image, video, and BatchImage inputs, but this will cause the video to be parsed as an image (first frame), consistent with the test results on the official website.

Qwen2.5-VL

Installation:

Clone this repository to 'ComfyUI/custom_nodes` folder.

Install the dependencies in requirements.txt, transformers version 4.49.0 minimum is required:

pip install -r requirements.txt

or if you use portable (run this in ComfyUI_windows_portable -folder):

python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-Qwen2_5-VL\requirements.txt

Supports most Qwen2.5-VL models, which can be automatically downloaded with the DownloadAndLoadQwen2_5_VLModel to ComfyUI/models/VLM:

"Qwen/Qwen2.5-VL-3B-Instruct",
"Qwen/Qwen2.5-VL-3B-Instruct-AWQ",
"Qwen/Qwen2.5-VL-7B-Instruct",
"Qwen/Qwen2.5-VL-7B-Instruct-AWQ",
"Qwen/Qwen2.5-VL-32B-Instruct",
"Qwen/Qwen2.5-VL-32B-Instruct-AWQ",
"Qwen/Qwen2.5-VL-72B-Instruct",
"Qwen/Qwen2.5-VL-72B-Instruct-AWQ",

node name：

Qwen2_5-VL/DownloadAndLoadQwen2_5_VLModel
Qwen2_5-VL/Qwen2_5_VL_Run
Qwen2_5-VL/BatchImageLoaderToLocalFiles

<details> <summary>Minimum VRAM requirements</summary>

| Precision | Qwen2.5-VL-3B | Qwen2.5-VL-7B | Qwen2.5-VL-72B | | --------- | ------------- | ------------- | -------------- | | FP32 | 11.5 GB | 26.34 GB | 266.21 GB | | BF16 | 5.75 GB | 13.17 GB | 133.11 GB | | INT8 | 2.87 GB | 6.59 GB | 66.5 GB | | INT4 | 1.44 GB | 3.29 GB | 33.28 GB |

Note: The table above presents the theoretical minimum video memory requirements for inference with transformers; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here.

</details>

Citation


@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}

@article{Qwen2-VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

<br>