ComfyUI Extension: ComfyUI-Qwen-Omni
ComfyUI-Qwen-Omni is the first ComfyUI plugin that supports end-to-end multimodal interaction, enabling seamless joint generation and editing of text, images, and audio. Without intermediate steps, with just one operation, the model can simultaneously understand and process multiple input modalities, generating coherent text descriptions and voice outputs, providing an unprecedentedly smooth experience for AI creation.
Custom Nodes (0)
README
ComfyUI-Qwen-Omni ๐ผ
<p align="center"> <a href="README_CN.md">ไธญๆ</a>  ๏ฝ   English   </p>When Figma meets VSCode, the collision of artistic thinking and engineering logic โ this is a romantic declaration from designers to the world of code.
โจ A revolutionary multimodal plugin based on Qwen2.5-Omni โจ
A ComfyUI plugin based on the multimodal large language model Qwen2.5-Omni
๐ ComfyUI-Qwen-Omni is the first ComfyUI plugin that supports end-to-end multimodal interaction, enabling seamless joint generation and editing of text, images, and audio. Without intermediate steps, with just one operation, the model can simultaneously understand and process multiple input modalities, generating coherent text descriptions and voice outputs, providing an unprecedentedly smooth experience for AI creation.
This plugin integrates the Qwen2.5-Omni multimodal large model into ComfyUI, supporting text, image, audio, and video inputs, and capable of generating text and voice outputs, offering a more diverse interactive experience for your AI creation.
๐ Features
- Dual-Mode Omni: Supports Qwen2.5-Omni-3B and Qwen2.5-Omni-7B models.
- Multimodal input: Supports text, images, audio, and video as inputs.
- Text generation: Generates coherent text descriptions based on multimodal inputs.
- Speech synthesis: Supports generating natural and fluent voice outputs (male or female voices available).
- Parameterized control: Allows adjustment of generation parameters such as temperature, maximum tokens, and sampling strategy.
- GPU optimization: Supports 4-bit/8-bit quantization to reduce video memory requirements.
๐ Installation
1.Clone the repository to the ComfyUI extension directory:
cd ComfyUI/custom_nodes/
git clone https://github.com/SXQBW/ComfyUI-Qwen-Omni.git
cd ComfyUI-Qwen-Omni
pip install -r requirements.txt
2.Download Model files:
When first launched, the model will be automatically downloaded (based on your selection of Qwen2.5-Omni-3B or Qwen2.5-Omni-7B and network conditions, it will prioritize downloads from Hugging Face or ModelScope). Alternatively, you may manually download and place it in the ComfyUI/models/Qwen/ directory.
๐ฆ Model download links:
<p align="left"> ๐ค <a href="https://huggingface.co/Qwen/Qwen2.5-Omni-7B">Hugging Face</a>   |   ๐ค <a href="https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B">ModelScope</a>   |   </a> </p><p align="left"> โฌ <a href="https://pan.quark.cn/s/fdc4f7a1a5f2">Quark Netdisk</a>   ยท   <a href="https://pan.baidu.com/s/1Ejpi5fvI6_m1t1WSqWom8A?pwd=xvzf">Baidu Netdisk</a>  </a> </p>Additionally, I've uploaded the model files to Quark Netdisk and Baidu Netdisk (hope it helps you ๐).
๐ Usage Guide
- Add the "Qwen Omni Combined" node in ComfyUI.
- Configure the parameters:
- Select the quantization method (4-bit/8-bit/no quantization).
- Enter the text prompt.
- Choose whether to generate voice and the voice type.
- Adjust the generation parameters (temperature, maximum tokens, etc.).
- Optional: Connect image, audio, or video inputs.
- Execute the node to generate the results.
๐๏ธ Parameter Explanation
| Parameter | Description |
|--------------------|----------------------------------------------------------------------|
| max_tokens
| Controls the maximum length of the generated text (in tokens). Generally, 100 tokens correspond to approximately 50 - 100 Chinese characters or 67 - 100 English words. |
| temperature
| Controls the generation diversity: lower values generate more structured content, while higher values generate more random content. |
| top_p
| Nucleus sampling threshold, controlling the vocabulary selection range: closer to 1 retains more candidate words, while smaller values generate more conservative content. |
| repetition_penalty
| Controls repetitive content: >1 suppresses repetition, <1 encourages repetitive emphasis. |
| quantization
| Model quantization options: 4-bit (video memory friendly), 8-bit (balanced accuracy), or no quantization (high accuracy). |
| audio_output
| Voice output options: no voice generation, female voice (Chelsie), or male voice (Ethan). |
๐ก I've added tooltips to the node interface. Hover your mouse over the corresponding position to see the explanation.
๐ Function Examples
Usage interface examples in ComfyUI
Video Content Analysis
Example: What's the content in the video?
Supports generating natural and fluent voice outputs. Click to watch--Demo Video
Omni Input
Example: Craft an imaginative story that blends sounds, moving images, and still visuals into a unified plot .
Image Description Generation
Example: Just tell me the answers to the questions in the picture directly.
๐ Acknowledgments
<br>Heartfelt thanks to the following teams and projects for their support and contributions to the development of ComfyUI-Qwen-Omni.
Please give their projects aโญ๏ธ๏ผ</br>
-
Qwen Team (Alibaba Group)
Thanks to the developers of the Qwen-Omni series models, especially for the open-source contribution of the Qwen2.5-Omni-7B model.
Their groundbreaking work in the field of multimodal large models provides strong underlying support for the plugin. -
Doubao Team (ByteDance) and Hunyuan Team (Tencent)
During the plugin development process, Doubao AI provided important assistance in code debugging, documentation generation, and problem troubleshooting, greatly improving development efficiency. -
ComfyUI Community
The flexible node-based architecture of ComfyUI provides an ideal ecological environment for plugin development.
๐ From Pixels to Python: A Designer's Odyssey
Two weeks ago, my toolkit was dominated by Adobe CC and Figma files.
As a battle-hardened full-stack designer (PM/UX/UI triple threat) with a decade of experience, I thought my ultimate challenge was convincing clients to abandon requests for "vibrant dark mode with rainbow highlights". That is, until 3 AM on That Fateful Nightโขโwhen my 127th iteration of API documentation redesign hit a wallโthe nuclear option emerged:
"Why shouldn't designers write their own damn code?"
Thus this project was forged from:
- ๐จ A/B testing in my veins (art school PTSD edition)
- ๐ป A Frankenstein's Python rig (yes, even
pip install
was trial-by-fire) - โก๏ธ UX obsession that makes Apple designers blush (though only 30% implemented... for now)
๐ง Current Skill Frontier
- ๐จ Design system ninja still battling async IO demons
- ๐๏ธ Interactive prototype guru who sweats at recursive functions
- ๐ Architecture Picasso with <500 lines of real code
๐ Why Your Star Matters
Each โญ๏ธ becomes:
- A lighthouse guiding designer-to-coder transitions
- A digital whip pushing through coding roadblocks
- The ultimate nod to boundary-breakers (way cooler than Dribbble likes!)
"Every commit is my declaration of independence from the design-only world"
โ That designer clumsily typing in VSCode
Your star todayโจ
Not just approval, but the cosmic collision of design thinking and code logic. When aesthetic obsession meets geek spirit โ this might be GitHub's most romantic chemistry experiment.
Star This Cross-Disciplinary Revolution โ
๐ค Contributions
Welcome to contribute code, report issues, or submit suggestions! Please submit a Pull Request or create an Issue. Welcome contributions in the following forms: โจ Proposals for new features. ๐ Bug reports (please include reproduction steps and logs). ๐ Functional improvements. ๐ผ๏ธ Example workflows. If you have other questions or suggestions, email [email protected]