ComfyUI Extension: ComfyUI_Simple_Qwen3-VL-gguf

Authored by KLL535

Created about a month ago

Updated 5 days ago

13 stars

Simple Qwen3-VL gguf LLM model loader

Custom Nodes (0)

README

ComfyUI_Simple_Qwen3-VL-gguf

Simple Qwen3-VL gguf LLM model loader for Comfy-UI.

Why need this version?

This version was created to meet my requirements:

The model must support gguf (gguf models run faster than transformer models)
The model must support the Qwen3-VL multimodal model
After running, the node must be completely cleared from memory, leaving no garbage behind. This is important. Next come very resource-intensive processes that require ALL the memory. (Yes, you have to reload the model each time, but this is faster, especially on good hardware with fast memory and disks)
No auto-loaded models stored in some unknown location. You can use any models you already have (from LM Studio etc). Just simply specify their path on the disk. For me, this is the most comfortable method.
The node needs to run fast. ~10 seconds is acceptable for me. So, for now, only the gguf model can provide this.

What's the problem:

Qwen3-VL support hasn't been added to the standard library, llama-cpp-python, which is downloaded via pip install llama-cpp-python - this didn't work.

Check the version number of llama-cpp-python you're using. I used the version 0.3.17 or 0.3.18 from JamePeng and it supports qwen3. The standard version llama-cpp-python hasn't been updated for a long time. llama-cpp-python 0.3.16 last commit on Aug 15, 2025 and it doesn't support qwen3.

Workaround (until support is added):

Download this using Git:

https://github.com/JamePeng/llama-cpp-python

Download this using Git:

https://github.com/ggml-org/llama.cpp Place the second project llama.cpp\ in the llama-cpp-python\vendor\ folder

Go to the llama-cpp-python folder and run the command:

set CMAKE_ARGS="-DGGML_CUDA=on"
path_to_comfyui\python_embeded\python -m pip install -e . (If you have embedded Python, this is usually the case).

Warning: If you compiled with the -e flag, don't delete the folder you compiled from, it's needed.

Warning: Compilation can take a long time, somewhere between 30-60 minutes.

Wait for the package to build from source. (You can find ready-made WHL packages for your configuration)

What's next:

Use ComfyUI Manager or copy this project using git to the folder path_to_comfyui\ComfyUI\custom_nodes
Restart ComfyUI. We check in the console that custom nodes are loading without errors.
Restarting the frontend (F5)
Now the following node has appeared:

Qwen-VL Vision Language Model - The main node for working with LLM
Master Prompt Loader - Loads system prompt and user prompt presets
Master Prompt Loader (advanced) - Loads system prompt and user prompt presets. Contains a bunch of other options that are still under development.
Model Preset Loader (Advanced) - More convenient model selection from a json file. <img width="1810" height="625" alt="+++" src="https://github.com/user-attachments/assets/b7a8605b-0f95-4751-8db1-76c043ff3309" />

Parameters (update):

image, image2, image3: IMAGE - analyzed images, you can use up to 3 images. For example, you can instruct Qwen to combine all the images into one scene, and it will do so. You can also not include any images and use the model simply as a text LLM.
system prompt: STRING, default: "You are a highly accurate vision-language assistant. Provide detailed, precise, and well-structured image descriptions." - role + rules + format.
user prompt: STRING, default: "Describe this image" - specific case + input data + variable wishes.
model_path: STRING, default: H:\Qwen3VL-8B-Instruct-Q8_0.gguf - The path to the model is written here
mmproj_path: STRING, default: H:\mmproj-Qwen3VL-8B-Instruct-F16.gguf - The path to the mmproj model is written here; it is required and usually located in the same place as the model.
output_max_tokens: INT, default: 2048, min: 64, max: 4096 - The max number of tokens to output. A smaller number saves memory, but may result in a truncated response.
image_max_tokens: INT, default: 4096, min: 1024, max: 1024000 - The max number of tokens to image. A smaller number saves memory, but the image requires a lot of tokens, so you can't set them too few.
ctx: INT, default: 8192, min: 0, max: 1024000. - A smaller number saves memory. Rule: image_max_tokens + text_max_tokens + output_max_tokens <= ctx
n_batch: INT, default: 512, min: 64, max: 1024000 - Number of tokens processed simultaneously. A smaller number saves memory. Setting n_batch = ctx will speed up the work Rule: n_batch <= ctx
gpu_layers: INT, default: -1, min: -1, max: 100 - Allows you to transfer some layers to the CPU. If there is not enough memory, you can use the CPU, but this will significantly slow down the work. -1 means all layers in GPU. 0 means all layers in CPU.
temperature: FLOAT, default: 0.7, min: 0.0, max: 1.0
seed: INT, default: 42
unload_all_models: BOOLEAN, default: false - If Trie clear memory before start, code from ComfyUI-Unload-Model
top_p: FLOAT, default: 0.92, min: 0.0, max: 1.0
repeat_penalty: FLOAT, default: 1.2, min: 1.0, max: 2.0
top_k: INT, default: 0, min: 0, max: 32768 - for QwenVL recommended 0, for llava recommended 40
pool_size: INT, default: 4194304, min: 1048576, max: 10485760 - if the ggml memory pool is not enough, then you should increase it
script: STRING, default: "" - Here you can enter the name of the script to be called. If you don't enter anything, the script will be determined automatically based on the model file name. For now I publish: qwen3vl_run.py for Qwen3 and llavavl_run.py for llava.

Not customizable parameters:

image_min_tokens = 1024 - minimum number of tokens allocated for the image.
force_reasoning = False - reasoning mode off.
swa_full = True - disables Sliding Window Attention.
verbose = False - doesn't clutter the console.

Implementation Features:

The node is split into two parts. All work is isolated in a subprocess. Why? To ensure everything is cleaned up and nothing unnecessary remains in memory after this node runs and llama.cpp. I've often encountered other nodes leaving something behind, and that's unacceptable to me.

Models:

Regular Qwen:

https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF/tree/main For example: Qwen3VL-8B-Instruct-Q8_0.gguf + mmproj-Qwen3VL-8B-Instruct-F16.gguf

Uncensored Qwen (but the model isn't trained on NSFW and doesn't understand it well):

https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated/tree/main/GGUF For example: ggml-model-q8_0.gguf + mmproj-model-f16.gguf

Old model llava (joecaption) true NSFW:

https://huggingface.co/concedo/llama-joycaption-beta-one-hf-llava-mmproj-gguf/tree/main For example: llama-joycaption-beta-one-hf-llava-q8_0.gguf + llama-joycaption-beta-one-llava-mmproj-model-f16.gguf

Rule: The loader selection is determined by the file name, the llava model must contain the word llava in the name. Recommended parameter for joecaption:

temperature = 0.6
top_p = 0.9
repeat_penalty = 1.2
n_batch = 512
top_k = 40

Qwen3-VL-30B

https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF/tree/main For example: Qwen3-VL-30B-A3B-Instruct-Q4_K_S.gguf + mmproj-BF16.gguf

Pushing into 16Gb memory (image 1M): The model fills up the memory and runs for a long time 60 sec. We cram 5 layers out of 40 into the CPU and get x2 speedup.

gpu_layers = 35

Ministral-3-14B (Library llama.cpp update and reinstall required)

https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512-GGUF/tree/main For example: Ministral-3-14B-Instruct-2512-Q4_K_M.gguf + Ministral-3-14B-Instruct-2512-BF16-mmproj.gguf

Rule: The loader selection is determined by the file name, the word ministral or mistral must contain in the filename. My parameter for ministral:

        "Ministral-3-14B": {
            "model_path": "H:\\LLM2\\Ministral-3-14B-Instruct-2512-Q4_K_M\\Ministral-3-14B-Instruct-2512-Q4_K_M.gguf",
            "mmproj_path": "H:\\LLM2\\Ministral-3-14B-Instruct-2512-Q4_K_M\\Ministral-3-14B-Instruct-2512-BF16-mmproj.gguf",
            "output_max_tokens": 2048,
            "image_max_tokens": 4096,
            "ctx": 8192,
            "n_batch": 1024,
            "gpu_layers": -1,
            "temperature": 0.3,
            "top_p": 0.92,
            "repeat_penalty": 1.1,
            "top_k": 40,
            "pool_size": 4194304,
            "script": "llavavl_run.py"
        }

Qwen3-30B-A3B-Instruct-2507-Q4_K_S (not vision) Rule: the mmproj line must be empty. Model gguf must not be very ancient. Previously, they didn't contain a tokenizer. In this mode images are ignored.

https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF/tree/main For example: Qwen3-30B-A3B-Instruct-2507-Q4_K_S.gguf

Speed test and memory full issue:

LLM and CLIP cannot be split (as can be done with UNET). They must be loaded in their entirety. Therefore, to get good speed, you cannot exceed the VRAM overflow. Check in task manager if VRAM is getting full (which is causing slowdown).

Memory overflow (speed down):

Model fits (good speed):

To make the model fit:

Use stronger quantization
Reduce ctx, but not too much, otherwise the response may be cut off.
Use CPU offload (gpu_layers > 0, The lower the number, the more layers will be unloaded onto the CPU; the number of layers depends on the model, start decreasing from 40) - It may be slow if the processor is weak.

The memory size (and speed) depends on model size, quantization method, the size of the input prompt, the output response, and the image size. Therefore, it is difficult to estimate the speed, but for me, with a prompt of 377 English words and a response of 225 English words and a 1024x1024 image on an RTX5080 card, with 8B Q8 model, the node executes in 13 seconds.

If the memory is full before this node starts working and there isn't enough memory, I used this project before node:

https://github.com/SeanScripts/ComfyUI-Unload-Model But sometimes the model would still load between this node and my node. So I just stole the code from there and pasted it into my node with the flag unload_all_models.

More options:

I wanted to give creative freedom and control LLM, so you could write any system prompt or change it on the fly. But if anyone wants to use templates, here's a solution that won't deprive the node of its previous capabilities. If you need to use a template prompt, include a special loader Master Prompt Loader. If you need to add new templates, you can add them here: custom_nodes\ComfyUI_Simple_Qwen3-VL-gguf\system_prompts_user.json (The system_prompts.json file contains default presets, but they can be updated). Just be sure not to violate the JSON format, otherwise the node won't load. You need to escape the quotes for ", like this \". You also need to make sure that the last line of the list doesn't have a comma at the end. Templates stolen from here: https://github.com/1038lab/ComfyUI-QwenVL

Simplifying the selection of models:

If you have a lot of models, you can write their PATHs and settings to the system_prompts_user.json file, as shown in the example system_prompts_user.example.json, and use the Model Preset Loader (Advanced) model selector, connecting it like this: <img width="739" height="624" alt="image" src="https://github.com/user-attachments/assets/c49273ed-6cbb-40fa-bc61-b2d8c164aeda" />

Then you can collapse the node and read only the important settings. <img width="586" height="480" alt="image" src="https://github.com/user-attachments/assets/c453aba1-0b86-4812-a4df-d448fd9f591b" />

Agreement:

The system_prompts.json file contains the project settings that I will be updating. Do not edit this file, or your changes will be deleted.
The system_prompts_user.json contains the exact same user settings. This file will not be updated. Edit this file.

Under construction:

Styles - work well 90% of the time. But you cannot set up instructions that contradict each other.
Camera settings - only work with the 30B model, but give very interesting results, some settings are useful.
The "describe..." descriptive group works for photorealism and forces the LLM to describe more details.
The other options - I don't see any point in them yet. The idea for this configurator is taken from here: https://huggingface.co/spaces/fancyfeast/joy-caption-beta-one

Troubleshooting:

Check that the libraries are installed to the latest versions. Create a test.py file with the following content:

import llama_cpp
print("llama-cpp-python version:", llama_cpp.__version__)
try:
    from llama_cpp import llama_print_system_info
    info = llama_print_system_info()
    print(info.decode('utf-8'))
except Exception as e2:
    print("Failed:", e2)

Run it using your embedded python from the python_embeded folder :

H:\ComfyUI128\python_embeded>python temp\test.py
llama-cpp-python version: 0.3.17
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Stuck issue:

If the model gets stuck on a response, you need to:

increase the temperature
decrease top_p
increase repeat_penalty

ggml memory pool owerflow issue:

If an error occurs ggml_new_object: not enough space in the context's memory pool (needed 330192, available 16), try it:

increase pool_size
decrease ctx
decrease image_max_tokens
increase n_batch

Maybe it will be useful to someone.

[!] Tested only on Windows. Tested only on RTX5080. Tested only on Python 3.13.2 and Pytorch 2.10.0.dev20251121+cu130 Tested with Qwen3-VL-8B, Qwen3-VL-30B, llama-joycaption-beta-one.

Dependencies & Thanks:

https://github.com/JamePeng/llama-cpp-python
https://github.com/ggml-org/llama.cpp
https://github.com/SeanScripts/ComfyUI-Unload-Model
https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF/tree/main
https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated/tree/main/GGUF