ComfyUI Extension: comfyui-llamacpp-client
A comprehensive ComfyUI custom node that provides complete client functionality for llama-server from a/llama.cpp. This node acts as a bridge between ComfyUI workflows and llama-server instances, supporting every single parameter and endpoint that llama-server offers.
Custom Nodes (0)
README
ComfyUI Llama.cpp Client Node
A comprehensive ComfyUI custom node that provides complete client functionality for llama-server from llama.cpp. This node acts as a bridge between ComfyUI workflows and llama-server instances, supporting every single parameter and endpoint that llama-server offers.
๐ฏ What This Repository Does
This repository provides a single, powerful ComfyUI node that can communicate with any llama-server instance and utilize all of its capabilities within ComfyUI workflows. Instead of being limited to basic text generation, you get access to:
- 8 Complete API Endpoints: Every llama-server endpoint with full parameter support
- 100+ Parameters: Every single parameter that llama-server accepts
- Advanced AI Features: Function calling, multimodal processing, structured output, and more
- Production Ready: Error handling, authentication, caching, and performance optimization
- Plug & Play: Easy integration into existing ComfyUI workflows
๐ Key Capabilities
Complete API Coverage
| Endpoint | Purpose | What You Can Do |
|----------|---------|-----------------|
| /completion
| Text generation | Stories, articles, creative writing, Q&A |
| /v1/chat/completions
| Chat conversations | Multi-turn conversations, roleplay, assistants |
| /v1/embeddings
| Text embeddings | Semantic search, clustering, similarity analysis |
| /tokenize
| Text analysis | Token counting, text preprocessing |
| /detokenize
| Token conversion | Debug tokenization, convert tokens back |
| /apply-template
| Chat formatting | Test chat templates, format conversations |
| /infill
| Code completion | Fill-in-the-middle coding, code assistance |
| /v1/rerank
| Document ranking | Search relevance, document sorting |
Advanced Sampling Methods
- DRY Sampling: Eliminates repetition with smart sequence detection
- XTC Sampling: Cross-token coherence for better consistency
- Mirostat: Perplexity-based sampling for controlled creativity
- Dynamic Temperature: Adaptive temperature that changes during generation
- Custom Sampler Chains: Define your own sampling pipeline
Structured Output & Constraints
- JSON Schema: Force valid JSON output with custom schemas
- BNF Grammar: Use formal grammars to constrain generation
- Logit Bias: Fine-tune token probabilities for specific words/concepts
- Stop Sequences: Precise control over when generation stops
Multimodal & Function Calling
- Vision Models: Process images with base64 encoding
- Function Calling: Let models call tools and functions
- Tool Integration: OpenAI-compatible function definitions
- Image References: Embed images directly in prompts
Performance & Production Features
- KV Cache Management: Reuse computations between requests
- Slot Management: Handle concurrent requests efficiently
- LoRA Adapters: Dynamic model adaptation per request
- Streaming: Real-time token generation
- Authentication: API key support for secure deployments
๐ฆ Installation
Prerequisites
- ComfyUI installed and running
- llama-server (from llama.cpp) running somewhere accessible
- Python 3.7+ with pip
Install the Node
-
Clone into ComfyUI custom nodes:
cd /path/to/ComfyUI/custom_nodes git clone https://github.com/fidecastro/comfyui-llamacpp-client.git
-
Install dependencies:
cd comfyui-llamacpp-client pip install -r requirements.txt
-
Restart ComfyUI - The node will appear in
AI/LlamaCpp
category
Start Your llama-server
For basic functionality:
./llama-server -m your-model.gguf -c 4096 --host 0.0.0.0 --port 8080
For all features:
./llama-server -m your-model.gguf -c 4096 \
--host 0.0.0.0 --port 8080 \
--slots --metrics --props \
--embedding --reranking
๐ ๏ธ Quick Start Examples
Basic Text Generation
Endpoint: completion
Server URL: http://127.0.0.1:8080
Prompt: "Write a short story about a robot learning to paint"
Temperature: 0.8
N Predict: 200
Creative Writing with Advanced Sampling
Endpoint: completion
Prompt: "Chapter 1: The Last Library"
Temperature: 1.0
Dynamic Temperature Range: 0.4
DRY Multiplier: 0.8
XTC Probability: 0.1
Repeat Penalty: 1.05
Structured JSON Output
Endpoint: completion
Prompt: "Generate a product review for a laptop"
JSON Schema: {
"type": "object",
"properties": {
"rating": {"type": "integer", "minimum": 1, "maximum": 5},
"title": {"type": "string"},
"pros": {"type": "array", "items": {"type": "string"}},
"cons": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
}
Chat Conversation
Endpoint: chat_completions
System Message: "You are a helpful coding tutor"
User Message: "Explain recursion with a simple Python example"
Temperature: 0.3
Max Tokens: 300
Code Completion
Endpoint: infill
Input Prefix: "def fibonacci(n):\n if n <= 1:\n return n\n "
Input Suffix: "\n return fibonacci(n-1) + fibonacci(n-2)"
Temperature: 0.1
Function Calling
Endpoint: chat_completions
User Message: "What's the weather like in Tokyo?"
Tools: [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}}
}
}
}
]
๐ What Makes This Special
Completeness
Unlike other ComfyUI nodes that support only basic parameters, this node exposes every single option that llama-server provides. No feature is left behind.
Real-World Ready
Built for production use with proper error handling, authentication, timeouts, and comprehensive logging. Works reliably in complex workflows.
Thoroughly Documented
- PARAMETERS.md: Complete reference for all 100+ parameters
- examples.md: Real-world configuration examples
- CHANGELOG.md: Version history and updates
- test_node.py: Automated testing script
Extensible Design
Easy to extend and modify. Clean, well-commented code that follows ComfyUI conventions.
๐งช Testing Your Setup
Run the included test script to verify everything works:
python test_node.py
This tests all endpoints and validates your server connection.
๐ง Advanced Use Cases
Content Creation Workflows
- Generate story outlines with structured JSON
- Create character dialogues with chat completions
- Fill in story gaps with infill completion
- Rank story ideas with reranking
Code Development Workflows
- Generate code with completion endpoint
- Debug with tokenize/detokenize
- Code completion with infill
- Documentation generation with structured output
Data Processing Workflows
- Generate embeddings for semantic search
- Rank documents by relevance
- Extract structured data with JSON schemas
- Process images with multimodal models
Interactive AI Workflows
- Multi-turn conversations with chat completions
- Function calling for external integrations
- Dynamic model behavior with LoRA adapters
- Real-time streaming for responsive UIs
๐ฏ Node Outputs
The node provides four outputs for maximum flexibility:
- Response: Clean, formatted response text
- Raw Response: Complete JSON response from server
- Error: Detailed error messages (empty if successful)
- Status Code: HTTP status code for debugging
๐ Parameter Categories
Generation Control (20+ parameters)
Temperature, top-k, top-p, min-p, seed, n_predict, streaming, etc.
Advanced Sampling (25+ parameters)
DRY, XTC, Mirostat, dynamic temperature, custom sampler chains, etc.
Repetition Management (10+ parameters)
Repeat penalty, presence penalty, frequency penalty, DRY settings, etc.
Constraints & Grammar (8+ parameters)
JSON schema, BNF grammar, logit bias, stop sequences, etc.
Multimodal & Tools (12+ parameters)
Image data, function definitions, tool choice, response format, etc.
Performance & Caching (15+ parameters)
Cache settings, slot management, timeouts, LoRA adapters, etc.
Chat & Conversation (10+ parameters)
Messages, system prompts, chat templates, prefilling, etc.
Specialized Endpoints (20+ parameters)
Tokenization, embeddings, infill, reranking specific options, etc.
๐จ Troubleshooting
Common Issues
- Connection refused: Check if llama-server is running and accessible
- Timeout errors: Increase timeout parameter for long generations
- Invalid JSON: Verify JSON parameter formatting in multiline fields
- Feature not working: Ensure llama-server started with required flags
Performance Tips
- Use
cache_prompt=true
for similar prompts - Set appropriate
id_slot
for concurrent requests - Configure
n_keep
to retain important context - Use streaming for long generations
- Optimize server batch sizes for your hardware
๐ค Contributing
We welcome contributions! This project aims to maintain complete compatibility with llama-server as it evolves.
- Fork the repository
- Create a feature branch
- Test with
python test_node.py
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- llama.cpp team for the excellent server implementation
- ComfyUI for the amazing workflow platform
- Open source community for feedback and contributions
๐ Project Stats
- 8 API Endpoints: Complete coverage
- 100+ Parameters: Every llama-server option
- 800+ Lines of Code: Robust implementation
- Full Documentation: Comprehensive guides and examples
- Production Ready: Error handling, testing, validation
Transform your ComfyUI workflows with the full power of llama.cpp ๐
This node bridges the gap between ComfyUI's visual workflow system and llama.cpp's powerful inference server, giving you access to cutting-edge AI capabilities in an intuitive, visual interface.