ComfyUI Extension: Qwen2.5-VL GGUF Nodes
ComfyUI nodes for running GGUF quantized Qwen2.5-VL models using llama.cpp
Custom Nodes (0)
README
ComfyUI-GGUF-VLM
Complete GGUF model support for ComfyUI with local and Nexa SDK inference modes.
š Features
Two Core Capabilities
- š¬ Text Models - Text-to-Text generation
- Qwen3, LLaMA3, DeepSeek-R1, Mistral, etc.
- Local GGUF models or Remote API services
- š¼ļø Vision Models - Image-Text-to-Text analysis
- Qwen2.5-VL, Qwen3-VL, LLaVA, MiniCPM-V, etc.
- Single image, multi-image comparison, video analysis
Key Features
- ā Unified interface - Simple node structure by model capability
- ā Multiple backends - GGUF (llama-cpp), Transformers, Remote API
- ā Auto model detection - Smart model loading and compatibility
- ā Thinking mode support - DeepSeek-R1, Qwen3-Thinking
- ā Multi-image analysis - Compare up to 6 images simultaneously
- ā Device optimization - CUDA, MPS, CPU with auto-detection
š¤ Supported Models
š¬ Text Models (Text-to-Text)
Qwen Series:
- Qwen3, Qwen2.5, Qwen-Chat
- Qwen3-Thinking (with thinking mode)
LLaMA Series:
- LLaMA-3.x, LLaMA-2
- Mistral, Mixtral
Other Models:
- DeepSeek-R1 (with thinking mode)
- Phi-3, Gemma, Yi
š¼ļø Vision Models (Image-Text-to-Text)
Qwen-VL Series:
- Qwen2.5-VL (recommended)
- Qwen3-VL
LLaVA Series:
- LLaVA-1.5, LLaVA-1.6
- LLaVA-NeXT
Other Vision Models:
- MiniCPM-V-2.6
- Phi-3-Vision
- InternVL
š” Note: Models must be in GGUF format for local inference, or accessible via Nexa/Ollama API for remote inference.
š¦ Installation
1. Install ComfyUI Custom Node
cd ComfyUI/custom_nodes
git clone https://github.com/walke2019/ComfyUI-GGUF-VLM.git
cd ComfyUI-GGUF-VLM
pip install -r requirements.txt
2. For Nexa SDK Mode (Optional)
# Install Nexa SDK
pip install nexaai
# Start Nexa service
nexa serve
The service will be available at http://127.0.0.1:11434
š Quick Start
Using Text Generation (Local GGUF)
Recommended for local GGUF files
[Text Model Loader]
āā model: Select your GGUF file
āā device: cuda/cpu/mps
ā
[Text Generation]
āā max_tokens: 256 ā Recommended for single paragraph
āā temperature: 0.7
āā top_p: 0.8
āā top_k: 40
āā repetition_penalty: 1.1
āā enable_thinking: False
āā prompt: "Your prompt here"
ā
Output: context, thinking
Features:
- ā Direct file access
- ā No service required
- ā Fast and simple
- ā Stop sequences prevent over-generation
- ā Automatic paragraph merging
Using Nexa SDK Mode
Recommended for Nexa SDK ecosystem
Step 1: Download Model
# Download a model using Nexa CLI
nexa pull mradermacher/Huihui-Qwen3-4B-Instruct-2507-abliterated-GGUF:Q8_0 --model-type llm
# Check downloaded models
nexa list
Step 2: Use in ComfyUI
[Nexa Model Selector]
āā base_url: http://127.0.0.1:11434
āā refresh_models: ā
āā system_prompt: (optional)
ā
[Nexa SDK Text Generation]
āā preset_model: Select from dropdown (auto-populated)
āā max_tokens: 256
āā temperature: 0.7
āā prompt: "Your prompt here"
ā
Output: context, thinking
Features:
- ā Centralized model management
- ā Auto-populated model list
- ā
Supports
nexa pullworkflow
š Available Nodes
Text Generation Nodes (Local GGUF)
š· Text Model Loader
Load GGUF models from /workspace/ComfyUI/models/LLM/GGUF/
Parameters:
model: Select from available GGUF filesdevice: cuda/cpu/mpsn_ctx: Context window (default: 8192)n_gpu_layers: GPU layers (-1 for all)
Output:
model: Model configuration
š· Text Generation
Generate text with loaded GGUF model
Parameters:
model: From Text Model Loadermax_tokens: Maximum tokens (1-8192, recommended: 256)temperature: Temperature (0.0-2.0)top_p: Top-p sampling (0.0-1.0)top_k: Top-k sampling (0-100)repetition_penalty: Repetition penalty (1.0-2.0)enable_thinking: Enable thinking modeprompt: Input prompt (at bottom for easy editing)
Outputs:
context: Generated textthinking: Thinking process (if enabled)
Features:
- ā
Stop sequences:
["User:", "System:", "\n\n\n", "\n\n##", "\n\nNote:", "\n\nThis "] - ā Automatic paragraph merging for single-paragraph prompts
- ā Detailed console logging
Nexa SDK Nodes
š· Nexa Model Selector
Configure Nexa SDK service
Parameters:
base_url: Service URL (default:http://127.0.0.1:11434)refresh_models: Refresh model listsystem_prompt: System prompt (optional)
Output:
model_config: Configuration for Text Generation
š· Nexa SDK Text Generation
Generate text using Nexa SDK
Parameters:
model_config: From Model Selectorpreset_model: Select from dropdown (auto-populated fromnexa list)custom_model: Custom model ID (format:author/model:quant)auto_download: Auto-download if missingmax_tokens: Maximum tokens (recommended: 256)temperature,top_p,top_k,repetition_penalty: Generation parametersenable_thinking: Enable thinking modeprompt: Input prompt (at bottom)
Outputs:
context: Generated textthinking: Thinking process (if enabled)
Preset Models:
DavidAU/Qwen3-8B-64k-Josiefied-Uncensored-HORROR-Max-GGUF:Q6_Kmradermacher/Huihui-Qwen3-4B-Instruct-2507-abliterated-GGUF:Q8_0prithivMLmods/Qwen3-4B-2507-abliterated-GGUF:Q8_0
š· Nexa Service Status
Check Nexa SDK service status
Parameters:
base_url: Service URLrefresh: Refresh model list
Output:
status: Service status and model list
šÆ Best Practices
For Single-Paragraph Output
System Prompt:
You are an expert prompt generator. Output ONLY in English.
**CRITICAL: Output EXACTLY ONE continuous paragraph. Maximum 400 words.**
Parameters:
max_tokens: 256 ā Key setting!
temperature: 0.7
top_p: 0.8
top_k: 20
Why max_tokens=256?
- ā Prevents over-generation
- ā Model completes task without extra commentary
- ā Reduces from ~2700 chars (11 paragraphs) to ~1300 chars (1 paragraph)
For Multi-Turn Conversations
Include history directly in prompt:
User: Hello
Assistant: Hi! How can I help?
User: Tell me a joke
No need for separate conversation history parameter.
š Thinking Mode
Automatically extracts thinking process from models like DeepSeek-R1 and Qwen3-Thinking.
Supported Tags:
<think>...</think>(DeepSeek-R1, Qwen3)<thinking>...</thinking>[THINKING]...[/THINKING]
Usage:
[Text Generation]
āā enable_thinking: True
āā prompt: "Explain your reasoning"
ā
Outputs:
āā context: Final answer (thinking tags removed)
āā thinking: Extracted thinking process
Disable Thinking:
- Set
enable_thinking: False - Or add
no_thinkto system prompt
š Mode Comparison
| Feature | Text Generation (Local) | Nexa SDK |
|---------|------------------------|----------|
| Setup | Copy GGUF file | nexa pull |
| Service | Not required | Requires nexa serve |
| Model Management | Manual | CLI (nexa list, nexa pull) |
| Use Case | Local files, production | Nexa ecosystem, shared models |
| Speed | Fast | Fast (via service) |
| Flexibility | Any GGUF file | Only nexa pull models |
Recommendation:
- Use Text Generation for local GGUF files
- Use Nexa SDK if you're already using Nexa ecosystem
š Troubleshooting
Output Too Long (Multiple Paragraphs)
Problem: Model generates 11 paragraphs instead of 1
Solution:
- Reduce max_tokens from 512 to 256
- Strengthen system prompt: Add "EXACTLY ONE paragraph"
- Stop sequences are already configured
Nexa Service Not Available
Problem: ā Nexa SDK service is not available
Solution:
- Start service:
nexa serve - Check:
curl http://127.0.0.1:11434/v1/models - Verify URL in node
Model Not in Dropdown
Problem: Downloaded model doesn't appear in Nexa SDK dropdown
Solution:
- Check:
nexa list - Click "refresh_models" in Nexa Model Selector
- Restart ComfyUI
0B Entries in nexa list
Problem: nexa list shows 0B entries
Solution:
# Clean up invalid entries
rm -rf ~/.cache/nexa.ai/nexa_sdk/models/local
rm -rf ~/.cache/nexa.ai/nexa_sdk/models/workspace
find ~/.cache/nexa.ai/nexa_sdk/models -name "*.lock" -delete
# Verify
nexa list
š Directory Structure
ComfyUI-GGUF-VLM/
āāā README.md # This file
āāā requirements.txt # Dependencies
āāā __init__.py # Node registration
āāā config/
ā āāā paths.py # Path configuration
āāā core/
ā āāā inference_engine.py # GGUF inference engine
ā āāā model_loader.py # Model loader
ā āāā inference/
ā āāā nexa_engine.py # Nexa SDK engine
ā āāā transformers_engine.py # Transformers engine
āāā nodes/
ā āāā text_node.py # Text Generation nodes
ā āāā nexa_text_node.py # Nexa SDK nodes
ā āāā vision_node.py # Vision nodes
ā āāā system_prompt_node.py # System prompt config
āāā utils/
āāā device_optimizer.py # Device optimization
āāā system_prompts.py # System prompt presets
š Recent Updates
v2.2 (2025-10-29)
- ā
Simplified Nexa Model Selector - Removed unused
models_dirandmodel_source - ā Removed unused outputs - Cleaner node interface
- ā Moved prompt to bottom - Better UX for long prompts
- ā Removed conversation_history - Use prompt directly
- ā Stop sequences - Prevent over-generation
- ā Paragraph merging - Clean single-paragraph output
- ā Dynamic model list - Auto-populated from Nexa SDK API
- ā Detailed logging - Debug-friendly console output
v2.1
- ā Nexa SDK integration
- ā Preset model list
- ā Thinking mode support
v2.0
- ā GGUF mode with llama-cpp-python
- ā ComfyUI /models/LLM integration
š Requirements
llama-cpp-python>=0.2.0
transformers>=4.30.0
torch>=2.0.0
Pillow>=9.0.0
requests>=2.25.0
nexaai # Optional, for Nexa SDK mode
š¤ Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
š License
MIT License - see LICENSE file for details
- Nexa SDK: https://github.com/NexaAI/nexa-sdk
- ComfyUI: https://github.com/comfyanonymous/ComfyUI