ComfyUI Extension: Shrug-Prompter: Unified VLM Integration for ComfyUI

Authored by fblissjr

Created 30 days ago

Updated 4 days ago

5 stars

A comprehensive Vision-Language Model (VLM) integration system for ComfyUI with more intelligent prompt optimization, object detection, template support, and performance optimizations. Optimized for Wan2.1, Flux Kontext, and general purpose. Goes well with my other project, an MLX/llama.cpp server with hot swappable models and ollama api compatibility, (heylookitsanllm)a/https://github.com/fblissjr/heylookitsanllm

Custom Nodes (0)

README

Shrug-Prompter

Clean, memory-efficient VLM nodes for ComfyUI, with state management, looping, keyframe extraction, batching, and built-in templates for Wan2.1, VACE, and beyond.

Initially built and tested with my local (OpenAI API compatible and sorta ollama compatible) vision LLM server, https://github.com/fblissjr/heylookitsanllm, which works with any GGUF or MLX model. Most features will work with any OpenAI-compatible API endpoint, but I've been deviating from that as the performance I needed changed.

Built for Wan2.1, WAN VACE, and FLUX Kontext, but obviously can be extended beyond that. The nodes are generic enough to work with any workflow that needs vision-to-text capabilities and some utility nodes that go along with it.

Handles State Management Looping, Batching, Keyframe Extraction, and many more edge cases that drove me crazy when building this project. If you've ever tried to get VLMs working properly inside ComfyUI loops, you know the pain.

Built with love for Bandoco (and the broader community), where I've learned a ton over the years, and where all the amazing innovation in this space is happening right now.

What is it?

Shrug-prompter is a set of ComfyUI nodes that connect vision language models (VLMs) to video generation workflows. It lets you analyze keyframes and generate context-aware prompts automatically instead of typing them manually or copying and pasting from other LLMs. There's templates that have been refined with LLMs from datasets and eval datasets that I've found have been correlated with Wan2.1 and VACE prompts (in other words, what I've inferred to be closely aligned with the training dataset). I'm fairly awful at creating workflows, but I've tried to create a few to show how these nodes work together and how the various utility nodes play a role in the overall process.

That all said, the nodes are modular, and you can load your own prompt templates, or edit mine. They're meant as a starter, and are in the templates folder as markdown files. Some are few-shot, some are not. The model family and size you use matters. I've found qwen2.5-vl 72B to work best, likely because it's large enough to handle the few-shot examples, and more importantly, it's likely in the same model family as what was used to rewrite the captions used for training Wan2.1. Model family matters here, because of all those little nuances in LLMs and vocabs and tokenization. Few-shot examples tend to work well, but latency can be brutal without prompt caching.

Why more prompting custom nodes?

I've now spent years obsessively bouncing back and forth between the LLM space the diffusion space, and what's neat is how both sides are now starting to converge more and more. Prompts need to closely align with the training dataset for models that increasingly rely on automated LLM-driven captions (such as those for FLUX Kontext and Wan). Manual captioning doesn't really work, nor is it very fun or scalable trying to fit your language and style and structure to the training dataset, but it does need some human guidance to get wherever it is you want to go. These nodes aim to help with that.

There's tons of custom nodes out there for captioning and prompting for video and image generation, but most tend to come in two flavors: a local/on-device one that runs on the same machine as ComfyUI (and eats up resources, or is kept limited to a small model that doesn't perform very well), or cloud-based ones that hit commercial endpoints. This is squarely in the on-device / local space, but thrives in an environment where you might be driving your daily use with an Apple Silicon Mac or a Linux or Windows machine with enough RAM to run a large capable model.

shrug-prompter was born as a more tunable alternative to those, but really was created as a testbed client for an API server I've been working on that unifies Apple's mlx-lm, mlx-vlm, and llama.cpp gguf models, under a single endpoint - called heylookitsanllm. It began as an OpenAI compatible endpoint, and slowly evolved as the features and performance I needed from ComfyUI calling it grew. I added ollama API endpoints into it as more of an afterthought, primarily because of how many folks I see running ollama models - which tend to not perform as well due to odd defaults and quirks - but are used because of the ease-of-adoption factor.

Little quality of life feautures, like auto-resizing images via pushing down to the server as multi-part raw messages vs. base64 increased performance by ~33%, and prevented some OOMs in ComfyUI for generations that were already right on the edge of my 4090's limited VRAM. So with that pain came the ability to pass params like resize_params = ['resize_max', 'resize_width', 'resize_height', 'image_quality', 'preserve_alpha'] - and suddenly I wasn't OpenAI compatible anymore, but more of a superset of sorts.

Due to this, you're unlikely to see as much value with these nodes without also running heylookitsanllm. But maybe you will - let me know.

Quick Start

(this is the first custom node i've invested any real significant time into to get over the finish line - will look into ComfyUI Manager integration soon if there's value here)

Install in ComfyUI custom nodes folder
Start your VLM server (heylookitsanllm or any OpenAI-compatible endpoint)
Load an example workflow and go

Example Workflows

Basic VLM Prompting

example_workflows/simple_vlm_prompt.json

Connect images → VLM → get descriptions
Shows basic setup with provider config

Video Frame Interpolation

example_workflows/video_interpolation_loop.json

Extract frame pairs from keyframes
Generate prompts for each transition
Works with ForLoop structures

Batch Processing

example_workflows/batch_vlm_processing.json

Process multiple images in one go
Automatic memory management
Accumulate results for downstream use

WAN/VACE Integration

example_workflows/wan_vace_vlm.json

Replace manual prompts with VLM analysis
Processes frame pairs for smooth transitions
Uses wan_vace_transition template for best results

Core Nodes

VLM Configuration & Processing

VLM Provider Config - Set your API endpoint and model

Point to any OpenAI-compatible endpoint (local or cloud, although you'll find some features just only work with heylookitsanllm)
Auto-detects server capabilities for optimal performance via the /capabilities endpoint

Shrug Prompter - Main VLM interface with template support

Smart batch processing - handles single images or batches automatically
Built-in response cleanup (trim spaces, fix unicode, etc)
Server-side / pushdown image resizing for ~33% faster processing
Template support for consistent prompting
Debug mode shows raw API requests/responses
Smart JSON parsing - automatically extracts prompts from various response formats

VLM Image Processor - All-in-one image prep

Handles any aspect ratio and size
Smart memory management
Optional preprocessing for specific models
Batch and sequential processing
Supports 1 image, 2, 4, whatever makes sense for your use case and the model's capabilities

VLM Image Passthrough - Zero-copy alternative

Use this when you don't need preprocessing
Passes images directly to VLM without copies (meaning no memory overhead, theoretically)

Video & Frame Management

Video Frame Pair Extractor - Get consecutive frame pairs for interpolation

Works inside ForLoop structures
Handles edge cases (like odd frame counts)
Outputs start/end frames for each transition
Intended for VACE workflows

Smart Image Range Extractor - Easier frame extraction

Should avoid failures on out-of-bounds indices
Handles single images, empty batches
Works with dynamic loop indices

Video Segment Assembler - Reassemble video segments

Multiple streaming modes
Handles overlapping segments
Preserves temporal coherence

Loop & Accumulation

Loop Aware VLM Accumulator - Collect results across loop iterations

Works properly inside ForLoop structures
Handles both single and batch responses
Optional reset to clear previous runs
Extracts cleaned responses automatically

Loop Aware Response Iterator - Access accumulated results by index

Syncs with ForLoop indices
Handles empty results gracefully
Backward compatible with old workflows

Loop Safe Accumulator - Original accumulator for simple cases

Use when you need basic accumulation
No persistence between runs

Text Processing

Text Cleanup - Clean LLM responses

Remove leading/trailing spaces
Fix unicode characters (smart quotes → regular quotes)
Remove newlines, collapse whitespace
Strip to ASCII-only if needed
Custom replacements via simple patterns

Text List Cleanup - Same but for lists of text

Process batch responses
Join with custom separators (like "|" for WAN)
Maintain order and indexing

Text List Indexer - Extract single item from list

Essential for connecting VLM lists to nodes expecting strings
Handles out-of-bounds gracefully

Utilities

Prompt Template Loader - Load markdown templates

Searches templates/ directory recursively
Supports YAML frontmatter for metadata
Templates can be few-shot or zero-shot

Auto Memory Manager - Automatic cleanup -

Likely will cause more harm than good given how much memory management happens in the background in most good nodes already, but here just in case
Place after heavy operations
Multiple aggression levels
Works with PyTorch and system memory

Advanced VLM Sampler - Fine-tune generation

Extra parameters like top_k, repetition_penalty
Processing modes for different use cases
Returns config dict for reuse

Templates

Pre-built prompt templates in templates/:

For WAN/VACE Workflows:

wan_vace_transition.md - Frame-to-frame transitions (recommended for VACE)
wan_vace_frame_description.md - Single frame analysis
wan_vace_batch_transitions.md - Batch process N frames → N-1 transitions
wan_prompt_rewriter_qwen.md - Image-grounded prompt enhancement

General Purpose:

cinematographer.md - Updated for WAN-style descriptions
feel free to add your own or have any LLM modify existing ones to fit your needs

Tips & Best Practices

Getting Started

Start heylookitsanllm first: heylookllm --api openai (or use your own)
Use VLM models with vision support (look for "(Vision)" in dropdown)
For video workflows: keyframes → frame pairs → prompts → video
Memory is managed automatically - just connect and go

Performance Tips

Use VLMImagePassthrough instead of VLMImageProcessor when you don't need preprocessing
Enable batch_mode=true in ShrugPrompter for multiple independent images
Set resize_mode="max" with resize_value=512 for fast processing
The multipart endpoint is auto-detected and 57ms faster per image

Multi-Image Handling in ShrugPrompter

batch_mode=false (default):

All images sent in ONE message: [text, image1, image2, image3...]
VLM sees all images simultaneously in the same context
Perfect for: "Compare these frames", "Analyze the transition", "Describe changes between images"
Use this for WAN VACE frame-to-frame transitions

batch_mode=true:

Each image gets its own separate API call
Returns multiple independent responses
Good for: Processing many unrelated images efficiently
Each image is analyzed in isolation

Example for frame transitions:

StartFrame → ImageBatch → ShrugPrompter (batch_mode=false, template=wan_vace_transition.md)
EndFrame   ↗

Working with Loops

Always connect FLOW_CONTROL from ForLoopOpen to ForLoopClose
Use reset=true on accumulators if you don't want persistence
Place AutoMemoryManager after heavy operations in loops
Loop indices are 0-based, plan accordingly

Text Cleanup

Use response_cleanup="standard" should work for most cases, but should validate your pre-cleanup and post-cleanup text in the server log
"basic" just trims whitespace
"strict" removes all non-ASCII characters
Custom cleanup with TextCleanup node for specific needs
Batch works too

WAN VACE Workflows

VACE generates smooth video between keyframes
Use wan_vace_transition.md template for frame pairs
Each prompt describes the journey from frame A to frame B
Include all visible elements: subjects, objects, background
The text encoder expects detailed, grounded descriptions
For N keyframes, generate N-1 transition prompts

Debugging

Enable debug_mode=true in ShrugPrompter to see API calls
Add ShowText nodes to inspect intermediate values
Check accumulator debug_info output for state tracking
Timeout: Default 5 minutes per request (configurable in ShrugPrompter node)

Requirements

ComfyUI
LLM / VLM server (heylookitsanllm recommended since it's what I'm driving everything from)
- llama.cpp & mlx + mlx-vlm unified
Vision-capable model (qwen2-vl, gemma3n, mistral small, etc)

Whew, that's a lot. Hope this is helpful for some people, despite it likely being very niche.