ComfyUI Extension: ComfyUI-VideoDescription
Video description custom nodes for ComfyUI powered by advanced vision-language models.
Custom Nodes (0)
README
ComfyUI-VideoDescription
Video description custom nodes for ComfyUI powered by advanced vision-language models.
Features
- š„ Full Video Analysis with Qwen3-VL-8B-Instruct
- šÆ Alternative Video Analysis with NVIDIA DAM-3B-Video (CUDA only)
- ā” Optimized inference with model caching
- š Multiple analysis types (detailed, summary, keywords/action)
- š Smart path resolution for easy video loading
- š§ Easy integration with ComfyUI workflows
Installation
Step 1: Install Base Dependencies
cd ComfyUI/custom_nodes/ComfyUI-VideoDescription
pip install -r requirements.txt
This installs:
- transformers, tokenizers, accelerate (for Qwen3-VL)
- qwen-vl-utils (Qwen3-VL helper library)
Step 2: Install describe-anything (for DAM Node)
ā ļø IMPORTANT: describe-anything must be installed separately with --no-deps flag to avoid dependency conflicts.
pip install --no-deps git+https://github.com/NVlabs/describe-anything.git
Why --no-deps?
- describe-anything requires
numpy<2.0.0andpydantic<=2.10.6in its pyproject.toml - ComfyUI uses
numpy 2.xandpydantic 2.11+ - Installing with dependencies would break your ComfyUI environment
- However, describe-anything works fine with newer versions (tested and verified)
- Using
--no-depsskips the dependency check and preserves your ComfyUI packages
Complete Installation:
cd ComfyUI/custom_nodes
git clone https://github.com/IXIWORKS-KIMJUNGHO/ComfyUI-VideoDescription.git
cd ComfyUI-VideoDescription
# Install ONLY the required packages
# (ComfyUI already has torch, transformers, numpy, etc.)
pip install qwen-vl-utils opencv-python
# For DAM support (CUDA only) - use --no-deps flag
pip install --no-deps git+https://github.com/NVlabs/describe-anything.git
ā ļø IMPORTANT: Do NOT run pip install -r requirements.txt directly!
This will conflict with ComfyUI's existing packages. Install only the specific packages listed above.
Model Download
Important: You need to download the models before using these nodes.
Model Locations
Models will be downloaded to:
ComfyUI/models/video_description/
āāā Qwen3-VL-8B-Instruct/ # Full video analysis model
āāā DAM-3B-Video/ # Region-based analysis model
This follows ComfyUI's standard model organization structure.
Option 1: Pre-download (Recommended)
Download the models before first use to avoid waiting:
cd ComfyUI/custom_nodes/ComfyUI-VideoDescription
python download_models.py
This will download models to ComfyUI/models/video_description/
Option 2: Automatic Download
Models will download automatically on first use. However, this will cause delays:
- Qwen3-VL: 10-30 minute delay (16GB download)
- DAM: 5-15 minute delay (7GB download)
Download behavior:
- Node checks model directory
- If not found: Downloads from Hugging Face
- If found: Loads directly (fast!)
Disk Space Requirements
- Qwen3-VL-8B (FP16): ~16GB
- DAM-3B-Video (FP16): ~7GB (CUDA only)
- With 4-bit quantization: ~12GB total
- Recommended free space: 30GB+ (for both models)
Hardware Requirements
Qwen3-VL Node:
- ā Works on: NVIDIA CUDA, Apple Silicon (MPS), CPU
- Recommended: 16GB+ RAM, GPU with 8GB+ VRAM
DAM Node:
- ā ļø CUDA ONLY: Requires NVIDIA GPU with CUDA support
- ā Does NOT work on: Mac (MPS), CPU-only, AMD GPUs
- Recommended: NVIDIA GPU with 8GB+ VRAM, Linux/Windows OS
Quick Start
Usage Example
-
Place your video in ComfyUI/input/ directory (Recommended)
# Example: Copy video to input folder cp my_video.mp4 ComfyUI/input/ -
Add the node in ComfyUI
- Look for "video" category in the node menu
- Add "Video Description (Qwen3-VL)" node
-
Configure the node
video_path: Enter just the filename (e.g.,my_video.mp4)- The node automatically searches in
ComfyUI/input/directory - Supports subfolders:
videos/my_video.mp4 - Or use absolute path:
/full/path/to/video.mp4
- The node automatically searches in
- Set your prompt and parameters
- Run to generate description
Testing the Node
- Restart ComfyUI
- Copy a test video to
ComfyUI/input/ - Add "Video Description (Qwen3-VL)" node
- Enter the video filename in
video_path(e.g.,test.mp4) - Run the workflow
Current Nodes
Video Description (Qwen3-VL) - v1.2.0
Status: Fully functional with smart path resolution and analysis type presets
Required Inputs:
video_path(STRING): Path to video file- Just filename:
video.mp4ā searches inComfyUI/input/ - Subfolder:
videos/scene1.mp4ā searches inComfyUI/input/videos/ - Absolute path:
/full/path/to/video.mp4
- Just filename:
analysis_type(DROPDOWN): Type of video analysis- detailed: Comprehensive description with rich details (384 tokens, temp 0.7)
- summary: Brief 2-3 sentence overview (128 tokens, temp 0.5)
- keywords: Structured metadata extraction (256 tokens, temp 0.3)
- Default: "detailed"
fps(FLOAT): Frames per second for video sampling- Default: 1.0, Min: 0.1, Max: 30.0
- Higher FPS = more frames analyzed = slower but more detailed
- Recommended: 0.5-1.0 for most videos
Optional Inputs:
custom_prompt(STRING): Custom analysis prompt- Overrides the analysis_type preset
- Use when you need specific questions answered
- Default: "" (uses analysis_type preset)
use_4bit(BOOLEAN): Enable 4-bit quantization- Default: False
- Reduces VRAM usage from ~16GB to ~8GB
temperature(FLOAT): Text generation creativity (LLM parameter)- Default: 0.7, Min: 0.0, Max: 1.0
- Only used when custom_prompt is provided
- Note: This is NOT the same as "denoise" in image generation
Outputs:
description(STRING): Generated video descriptioninfo(STRING): Processing information (duration, resolution, FPS, etc.)
How It Works:
- Resolves video path (searches in ComfyUI/input/ if relative)
- Validates video file format and accessibility
- Loads Qwen3-VL model (cached after first load)
- Extracts frames at specified FPS rate
- Generates natural language description using Vision-Language Model
- Returns description and metadata
Performance:
- 3-second video: ~14 seconds (model loading + inference)
- 30-second video: ~130 seconds
- First run includes model loading (5-6 seconds), subsequent runs reuse cached model
Video Description (DAM) - v1.1.0
Status: Fully functional - Full video analysis (CUDA Required)
Description: Uses NVIDIA DAM-3B-Video for detailed descriptions of entire video content.
ā ļø IMPORTANT: This node requires NVIDIA CUDA GPU. It does NOT work on:
- Mac (Apple Silicon MPS)
- CPU-only systems
- AMD GPUs
Use this node only on Linux/Windows systems with NVIDIA CUDA GPUs.
Required Inputs:
video_path(STRING): Path to video file (same as Qwen3-VL)- Just filename:
video.mp4ā searches inComfyUI/input/ - Subfolder:
videos/scene1.mp4ā searches inComfyUI/input/videos/ - Absolute path:
/full/path/to/video.mp4
- Just filename:
analysis_type(DROPDOWN): Type of video analysis- detailed: Comprehensive full video description (512 tokens, temp 0.2)
- summary: Brief 2-3 sentence overview (256 tokens, temp 0.2)
- action: Focus on actions/movements (384 tokens, temp 0.2)
Optional Inputs:
custom_prompt(STRING): Custom analysis promptmax_frames(INT): Maximum frames to process (default: 8, range: 1-32)use_4bit(BOOLEAN): Enable 4-bit quantization (default: False)temperature(FLOAT): Sampling temperature (default: 0.2)
Outputs:
description(STRING): Full video descriptioninfo(STRING): Processing metadata
How It Works:
- Resolves video path and validates file
- Loads DAM-3B-Video model (cached after first load)
- Samples frames uniformly across video
- Analyzes entire video frame (full-frame mask)
- Generates comprehensive description of video content
- Returns detailed analysis
Use Cases:
- Comprehensive video content analysis
- Scene understanding and description
- Action and event detection
- Alternative to Qwen3-VL with different model capabilities
Performance:
- Model size: ~7GB
- Inference time: ~10-20 seconds (8 frames)
- Works independently from Qwen3-VL node
Roadmap
Phase 1: Qwen3-VL Integration ā
- ā Model loader implementation with caching
- ā Video validation and info extraction
- ā Inference pipeline with Qwen3-VL
- ā Error handling and cleanup
- ā Smart path resolution with ComfyUI/input/ search
- ā Performance optimization (removed tensor conversion overhead)
- ā Analysis type presets (detailed/summary/keywords)
Phase 2: DAM Full Video Analysis ā
- ā DAM model loader with singleton caching
- ā Full video inference wrapper
- ā ComfyUI node integration
- ā Automatic full-frame mask generation
- ā Multi-frame video processing
- ā Analysis type presets (detailed/summary/action)
- ā Simplified node interface (removed region_points parameter)
Phase 3: Advanced Features (Planned)
- [ ] Dual-model combination node (Qwen3-VL + DAM)
- [ ] Batch processing support for multiple videos
- [ ] Video timestamp-based analysis
- [ ] Advanced prompt templates library
- [ ] Performance optimization and caching improvements
Phase 4: Production Ready (Future)
- [ ] Comprehensive testing
- [ ] Example workflows
- [ ] Performance benchmarks
- [ ] Community deployment
Requirements
- Python 3.8+
- PyTorch 2.0+
- NVIDIA GPU with 16GB+ VRAM (recommended)
- ComfyUI
Dependencies
See requirements.txt for full list:
- torch>=2.0.0
- transformers>=4.50.0
- accelerate>=0.20.0
- qwen-vl-utils
- opencv-python
- pillow
- numpy
Hardware Recommendations
| Configuration | GPU | VRAM | Performance | |--------------|-----|------|-------------| | Minimum | RTX 3060 | 12GB | Basic (with quantization) | | Recommended | RTX 4090 | 24GB | Good (FP16) | | Optimal | A100 | 40GB+ | Excellent (FP16, batch) |
License
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Support
- GitHub Issues: Report bugs or request features
- Documentation: Check README and code comments
Acknowledgments
- ComfyUI - Node-based UI for Stable Diffusion
- Qwen-VL - Vision-language model by Alibaba
- NVIDIA DAM - Description Anything Model
Changelog
v1.1.1 (2025-10-28)
- ā DAM Node: Removed region_points parameter for simplified interface
- ā DAM Node: Changed to full video analysis (automatic full-frame mask)
- ā DAM Node: Updated prompts from "marked region" to "this video"
- ā Updated installation guide with --no-deps explanation for describe-anything
- ā Fixed dependency conflicts between describe-anything and ComfyUI
- ā Fixed DAM model parameter names and prompt format (<image> tag)
- ā Fixed video path resolution for ComfyUI Desktop App custom directories
- ā Node display name updated: "DAM Region" ā "DAM"
v1.2.0 (2025-10-22)
- ā Analysis type presets: detailed / summary / keywords
- ā Optimized prompts and parameters for each analysis type
- ā Custom prompt support with override capability
- ā Automatic max_tokens and temperature configuration per type
- ā Enhanced USAGE.md with analysis type examples
v1.1.0 (2025-10-22)
- ā Smart video path resolution (auto-search in ComfyUI/input/)
- ā Performance optimization: removed tensor conversion (13x faster)
- ā Simplified to path-only input (cleaner architecture)
- ā Updated documentation with temperature vs denoise explanation
- ā Added comprehensive usage guide (USAGE.md)
v1.0.0 (2025-10-22)
- ā Full Qwen3-VL integration
- ā Model caching with singleton pattern
- ā 4-bit quantization support
- ā ComfyUI standard model directory structure
- ā Comprehensive error handling
v0.1.0-alpha (2025-10-21)
- Initial project structure
- Basic node skeleton
- Node registration system
- Documentation framework