ComfyUI Extension: ComfyUI_Simple_Qwen3-VL-gguf
Simple Qwen3-VL gguf LLM model loader
Custom Nodes (0)
README
ComfyUI_Simple_Qwen3-VL-gguf
Simple Qwen3-VL gguf LLM model loader for Comfy-UI.
Why need this version?
This version was created to meet my requirements:
- The model must support gguf (gguf models run faster than transformer models)
- The model must support the Qwen3-VL multimodal model
- After running, the node must be completely cleared from memory, leaving no garbage behind. This is important. Next come very resource-intensive processes that require ALL the memory. (Yes, you have to reload the model each time, but this is faster, especially on good hardware with fast memory and disks)
- No pre-loaded models stored in some unknown location. You can use any models you already have. Just download them using any convenient method (via a browser or even on a flash drive from a friend) and simply specify their path on the disk. For me, this is the most convenient method.
- The node needs to run fast. ~10 seconds is acceptable for me. So, for now, only the gguf model can provide this. There's also sdnq, but I haven't been able to get it running yet.
What's the problem:
Qwen3-VL support hasn't been added to the standard library, llama-cpp-python, which is downloaded via pip install llama-cpp-python - this didn't work for me.
Workaround (until support is added):
- Download this using Git:
- https://github.com/JamePeng/llama-cpp-python
- Download this using Git:
- https://github.com/ggml-org/llama.cpp
Place the second project
llama.cpp\in thellama-cpp-python\vendor\folder
- Go to the llama-cpp-python folder and run the command:
set CMAKE_ARGS="-DGGML_CUDA=on"path_to_comfyui\python_embeded\python -m pip install -e .(If you have embedded Python, this is usually the case).
- Wait for the package to build from source. (You can find ready-made WHL packages for your configuration)
What's next:
- Use ComfyUI Manager or copy this project using git to the folder
path_to_comfyui\ComfyUI\custom_nodes - Restart ComfyUI. We check in the console that custom nodes are loading without errors.
- Restarting the frontend (F5)
- Now the following node has appeared:
Qwen-VL Vision Language Model- The main node for working with LLMMaster Prompt Loader- Loads system prompt and user prompt presetsMaster Prompt Loader (advanced)- Loads system prompt and user prompt presets. Contains a bunch of other options that are still under development.Model Preset Loader (Advanced)- More convenient model selection from a json file. <img width="1810" height="625" alt="+++" src="https://github.com/user-attachments/assets/b7a8605b-0f95-4751-8db1-76c043ff3309" />
Parameters (update):
image,image2,image3: IMAGE - analyzed images, you can use up to 3 images. For example, you can instruct Qwen to combine all the images into one scene, and it will do so. You can also not include any images and use the model simply as a text LLM.system prompt: STRING, default: "You are a highly accurate vision-language assistant. Provide detailed, precise, and well-structured image descriptions." - role + rules + format.user prompt: STRING, default: "Describe this image" - specific case + input data + variable wishes.model_path: STRING, default:H:\Qwen3VL-8B-Instruct-Q8_0.gguf- The path to the model is written heremmproj_path: STRING, default:H:\mmproj-Qwen3VL-8B-Instruct-F16.gguf- The path to the mmproj model is written here; it is required and usually located in the same place as the model.output_max_tokens: INT, default: 2048, min: 64, max: 4096 - The max number of tokens to output. A smaller number saves memory, but may result in a truncated response.image_max_tokens: INT, default: 4096, min: 1024, max: 1024000 - The max number of tokens to image. A smaller number saves memory, but the image requires a lot of tokens, so you can't set them too few.ctx: INT, default: 8192, min: 0, max: 1024000. - A smaller number saves memory. Rule:image_max_tokens + text_max_tokens + output_max_tokens <= ctxn_batch: INT, default: 512, min: 64, max: 1024000 - Number of tokens processed simultaneously. A smaller number saves memory. Settingn_batch = ctxwill speed up the work Rule:n_batch <= ctxgpu_layers: INT, default: -1, min: -1, max: 100 - Allows you to transfer some layers to the CPU. If there is not enough memory, you can use the CPU, but this will significantly slow down the work. -1 means all layers in GPU. 0 means all layers in CPU.temperature: FLOAT, default: 0.7, min: 0.0, max: 1.0seed: INT, default: 42unload_all_models: BOOLEAN, default: false - If Trie clear memory before start, code fromComfyUI-Unload-Modeltop_p: FLOAT, default: 0.92, min: 0.0, max: 1.0repeat_penalty: FLOAT, default: 1.2, min: 1.0, max: 2.0top_k: INT, default: 0, min: 0, max: 32768 - for QwenVL recommended 0, for llava recommended 40pool_size: INT, default: 4194304, min: 1048576, max: 10485760 - if the ggml memory pool is not enough, then you should increase it
Not customizable parameters:
image_min_tokens= 1024 - minimum number of tokens allocated for the image.force_reasoning= False - reasoning mode off.swa_full= True - disables Sliding Window Attention.verbose= False - doesn't clutter the console.
Speed test and memory full issue:
LLM and CLIP cannot be split (as can be done with UNET). They must be loaded in their entirety. Therefore, to get good speed, you cannot exceed the VRAM overflow. Check in task manager if VRAM is getting full (which is causing x10 slowdown).
The memory size (and speed) depends on model size, quantization method, the size of the input prompt, the output response, and the image size. Therefore, it is difficult to estimate the speed, but for me, with a prompt of 377 English words and a response of 225 English words and a 1024x1024 image on an RTX5080 card, with 8B Q8 model, the node executes in 13 seconds.
If the memory is full before this node starts working and there isn't enough memory, I used this project before node:
- https://github.com/SeanScripts/ComfyUI-Unload-Model
But sometimes the model would still load between this node and my node. So I just stole the code from there and pasted it into my node with the flag unload_all_models.
Models:
- Regular Qwen:
- https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF/tree/main
For example:
Qwen3VL-8B-Instruct-Q8_0.gguf+mmproj-Qwen3VL-8B-Instruct-F16.gguf
- Uncensored Qwen (but the model isn't trained on NSFW and doesn't understand it well):
- https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated/tree/main/GGUF
For example:
ggml-model-q8_0.gguf+mmproj-model-f16.gguf
- Old model llava (joecaption) true NSFW:
- https://huggingface.co/concedo/llama-joycaption-beta-one-hf-llava-mmproj-gguf/tree/main
For example:
llama-joycaption-beta-one-hf-llava-q8_0.gguf+llama-joycaption-beta-one-llava-mmproj-model-f16.gguf
The loader selection is determined by the file name, the llava model must contain the word llava in the name.
Recommended parameter for joecaption:
temperature= 0.6top_p= 0.9repeat_penalty= 1.2n_batch= 512top_k= 40
- Qwen3-VL-30B
- https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF/tree/main
For example:
Qwen3-VL-30B-A3B-Instruct-Q4_K_S.gguf+mmproj-BF16.gguf
Pushing into 16Gb memory (image 1M): The model fills up the memory and runs for a long time 60 sec. We cram 5 layers out of 40 into the CPU and get x2 speedup.
gpu_layers= 35 <img width="739" height="624" alt="image" src="https://github.com/user-attachments/assets/c49273ed-6cbb-40fa-bc61-b2d8c164aeda" />
Implementation Features:
The node is split into two parts. All work is isolated in a subprocess. Why? To ensure everything is cleaned up and nothing unnecessary remains in memory after this node runs and llama.cpp. I've often encountered other nodes leaving something behind, and that's unacceptable to me.
More options:
I wanted to give creative freedom and control LLM, so you could write any system prompt or change it on the fly.
But if anyone wants to use templates, here's a solution that won't deprive the node of its previous capabilities.
If you need to use a template prompt, include a special loader Master Prompt Loader. If you need to add new templates, you can add them here: custom_nodes\ComfyUI_Simple_Qwen3-VL-gguf\system_prompts_user.json (The system_prompts.json file contains default presets, but they can be updated).
Just be sure not to violate the JSON format, otherwise the node won't load. You need to escape the quotes for ", like this \". You also need to make sure that the last line of the list doesn't have a comma at the end.
Templates stolen from here:
https://github.com/1038lab/ComfyUI-QwenVL
Simplifying the selection of models:
If you have a lot of models, you can write their PATHs and settings to the system_prompts_user.json file, as shown in the example system_prompts_user.example.json, and use the Model Preset Loader (Advanced) model selector, connecting it like this:
<img width="694" height="564" alt="image" src="https://github.com/user-attachments/assets/c39d5976-d996-4bb6-ba76-5a622db5a87d" />
Then you can collapse the node and read only the important settings. <img width="586" height="480" alt="image" src="https://github.com/user-attachments/assets/c453aba1-0b86-4812-a4df-d448fd9f591b" />
Agreement:
- The
system_prompts.jsonfile contains the project settings that I will be updating. Do not edit this file, or your changes will be deleted. - The
system_prompts_user.jsoncontains the exact same user settings. This file will not be updated. Edit this file.
Under construction:
- Styles - work well 90% of the time. But you cannot set up instructions that contradict each other.
- Camera settings - only work with the 30B model, but give very interesting results, some settings are useful.
- The "describe..." descriptive group works for photorealism and forces the LLM to describe more details.
- The other options - I don't see any point in them yet. The idea for this configurator is taken from here: https://huggingface.co/spaces/fancyfeast/joy-caption-beta-one
Troubleshooting:
Check that the libraries are installed to the latest versions. Create a test.py file with the following content:
import llama_cpp
print("llama-cpp-python version:", llama_cpp.__version__)
try:
from llama_cpp import llama_print_system_info
info = llama_print_system_info()
print(info.decode('utf-8'))
except Exception as e2:
print("Failed:", e2)
Run it using your embedded python from the python_embeded folder :
H:\ComfyUI128\python_embeded>python temp\test.py
llama-cpp-python version: 0.3.17
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Stuck issue:
If the model gets stuck on a response, you need to:
- increase the
temperature - decrease
top_p - increase
repeat_penalty
ggml memory pool owerflow issue:
If an error occurs ggml_new_object: not enough space in the context's memory pool (needed 330192, available 16), try it:
- increase
pool_size - decrease
ctx - decrease
image_max_tokens - increase
n_batch
Maybe it will be useful to someone.
[!] Tested only on Windows. Tested only on RTX5080. Tested only on Python 3.13.2 and Pytorch 2.10.0.dev20251121+cu130 Tested with Qwen3-VL-8B, Qwen3-VL-30B, llama-joycaption-beta-one.
Dependencies & Thanks:
- https://github.com/JamePeng/llama-cpp-python
- https://github.com/ggml-org/llama.cpp
- https://github.com/SeanScripts/ComfyUI-Unload-Model
- https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF/tree/main
- https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated/tree/main/GGUF