ComfyUI Extension: ComfyUI-VideoDescription

Authored by IXIWORKS-KIMJUNGHO

Created

Updated

0 stars

Video description custom nodes for ComfyUI powered by advanced vision-language models.

Custom Nodes (0)

    README

    ComfyUI-VideoDescription

    Video description custom nodes for ComfyUI powered by advanced vision-language models.

    Features

    • šŸŽ„ Full Video Analysis with Qwen3-VL-8B-Instruct
    • šŸŽÆ Alternative Video Analysis with NVIDIA DAM-3B-Video (CUDA only)
    • ⚔ Optimized inference with model caching
    • šŸ“Š Multiple analysis types (detailed, summary, keywords/action)
    • šŸš€ Smart path resolution for easy video loading
    • šŸ”§ Easy integration with ComfyUI workflows

    Installation

    Step 1: Install Base Dependencies

    cd ComfyUI/custom_nodes/ComfyUI-VideoDescription
    pip install -r requirements.txt
    

    This installs:

    • transformers, tokenizers, accelerate (for Qwen3-VL)
    • qwen-vl-utils (Qwen3-VL helper library)

    Step 2: Install describe-anything (for DAM Node)

    āš ļø IMPORTANT: describe-anything must be installed separately with --no-deps flag to avoid dependency conflicts.

    pip install --no-deps git+https://github.com/NVlabs/describe-anything.git
    

    Why --no-deps?

    • describe-anything requires numpy<2.0.0 and pydantic<=2.10.6 in its pyproject.toml
    • ComfyUI uses numpy 2.x and pydantic 2.11+
    • Installing with dependencies would break your ComfyUI environment
    • However, describe-anything works fine with newer versions (tested and verified)
    • Using --no-deps skips the dependency check and preserves your ComfyUI packages

    Complete Installation:

    cd ComfyUI/custom_nodes
    git clone https://github.com/IXIWORKS-KIMJUNGHO/ComfyUI-VideoDescription.git
    cd ComfyUI-VideoDescription
    
    # Install ONLY the required packages
    # (ComfyUI already has torch, transformers, numpy, etc.)
    pip install qwen-vl-utils opencv-python
    
    # For DAM support (CUDA only) - use --no-deps flag
    pip install --no-deps git+https://github.com/NVlabs/describe-anything.git
    

    āš ļø IMPORTANT: Do NOT run pip install -r requirements.txt directly! This will conflict with ComfyUI's existing packages. Install only the specific packages listed above.

    Model Download

    Important: You need to download the models before using these nodes.

    Model Locations

    Models will be downloaded to:

    ComfyUI/models/video_description/
    ā”œā”€ā”€ Qwen3-VL-8B-Instruct/    # Full video analysis model
    └── DAM-3B-Video/             # Region-based analysis model
    

    This follows ComfyUI's standard model organization structure.

    Option 1: Pre-download (Recommended)

    Download the models before first use to avoid waiting:

    cd ComfyUI/custom_nodes/ComfyUI-VideoDescription
    python download_models.py
    

    This will download models to ComfyUI/models/video_description/

    Option 2: Automatic Download

    Models will download automatically on first use. However, this will cause delays:

    • Qwen3-VL: 10-30 minute delay (16GB download)
    • DAM: 5-15 minute delay (7GB download)

    Download behavior:

    1. Node checks model directory
    2. If not found: Downloads from Hugging Face
    3. If found: Loads directly (fast!)

    Disk Space Requirements

    • Qwen3-VL-8B (FP16): ~16GB
    • DAM-3B-Video (FP16): ~7GB (CUDA only)
    • With 4-bit quantization: ~12GB total
    • Recommended free space: 30GB+ (for both models)

    Hardware Requirements

    Qwen3-VL Node:

    • āœ… Works on: NVIDIA CUDA, Apple Silicon (MPS), CPU
    • Recommended: 16GB+ RAM, GPU with 8GB+ VRAM

    DAM Node:

    • āš ļø CUDA ONLY: Requires NVIDIA GPU with CUDA support
    • āŒ Does NOT work on: Mac (MPS), CPU-only, AMD GPUs
    • Recommended: NVIDIA GPU with 8GB+ VRAM, Linux/Windows OS

    Quick Start

    Usage Example

    1. Place your video in ComfyUI/input/ directory (Recommended)

      # Example: Copy video to input folder
      cp my_video.mp4 ComfyUI/input/
      
    2. Add the node in ComfyUI

      • Look for "video" category in the node menu
      • Add "Video Description (Qwen3-VL)" node
    3. Configure the node

      • video_path: Enter just the filename (e.g., my_video.mp4)
        • The node automatically searches in ComfyUI/input/ directory
        • Supports subfolders: videos/my_video.mp4
        • Or use absolute path: /full/path/to/video.mp4
      • Set your prompt and parameters
      • Run to generate description

    Testing the Node

    1. Restart ComfyUI
    2. Copy a test video to ComfyUI/input/
    3. Add "Video Description (Qwen3-VL)" node
    4. Enter the video filename in video_path (e.g., test.mp4)
    5. Run the workflow

    Current Nodes

    Video Description (Qwen3-VL) - v1.2.0

    Status: Fully functional with smart path resolution and analysis type presets

    Required Inputs:

    • video_path (STRING): Path to video file
      • Just filename: video.mp4 → searches in ComfyUI/input/
      • Subfolder: videos/scene1.mp4 → searches in ComfyUI/input/videos/
      • Absolute path: /full/path/to/video.mp4
    • analysis_type (DROPDOWN): Type of video analysis
      • detailed: Comprehensive description with rich details (384 tokens, temp 0.7)
      • summary: Brief 2-3 sentence overview (128 tokens, temp 0.5)
      • keywords: Structured metadata extraction (256 tokens, temp 0.3)
      • Default: "detailed"
    • fps (FLOAT): Frames per second for video sampling
      • Default: 1.0, Min: 0.1, Max: 30.0
      • Higher FPS = more frames analyzed = slower but more detailed
      • Recommended: 0.5-1.0 for most videos

    Optional Inputs:

    • custom_prompt (STRING): Custom analysis prompt
      • Overrides the analysis_type preset
      • Use when you need specific questions answered
      • Default: "" (uses analysis_type preset)
    • use_4bit (BOOLEAN): Enable 4-bit quantization
      • Default: False
      • Reduces VRAM usage from ~16GB to ~8GB
    • temperature (FLOAT): Text generation creativity (LLM parameter)
      • Default: 0.7, Min: 0.0, Max: 1.0
      • Only used when custom_prompt is provided
      • Note: This is NOT the same as "denoise" in image generation

    Outputs:

    • description (STRING): Generated video description
    • info (STRING): Processing information (duration, resolution, FPS, etc.)

    How It Works:

    1. Resolves video path (searches in ComfyUI/input/ if relative)
    2. Validates video file format and accessibility
    3. Loads Qwen3-VL model (cached after first load)
    4. Extracts frames at specified FPS rate
    5. Generates natural language description using Vision-Language Model
    6. Returns description and metadata

    Performance:

    • 3-second video: ~14 seconds (model loading + inference)
    • 30-second video: ~130 seconds
    • First run includes model loading (5-6 seconds), subsequent runs reuse cached model

    Video Description (DAM) - v1.1.0

    Status: Fully functional - Full video analysis (CUDA Required)

    Description: Uses NVIDIA DAM-3B-Video for detailed descriptions of entire video content.

    āš ļø IMPORTANT: This node requires NVIDIA CUDA GPU. It does NOT work on:

    • Mac (Apple Silicon MPS)
    • CPU-only systems
    • AMD GPUs

    Use this node only on Linux/Windows systems with NVIDIA CUDA GPUs.

    Required Inputs:

    • video_path (STRING): Path to video file (same as Qwen3-VL)
      • Just filename: video.mp4 → searches in ComfyUI/input/
      • Subfolder: videos/scene1.mp4 → searches in ComfyUI/input/videos/
      • Absolute path: /full/path/to/video.mp4
    • analysis_type (DROPDOWN): Type of video analysis
      • detailed: Comprehensive full video description (512 tokens, temp 0.2)
      • summary: Brief 2-3 sentence overview (256 tokens, temp 0.2)
      • action: Focus on actions/movements (384 tokens, temp 0.2)

    Optional Inputs:

    • custom_prompt (STRING): Custom analysis prompt
    • max_frames (INT): Maximum frames to process (default: 8, range: 1-32)
    • use_4bit (BOOLEAN): Enable 4-bit quantization (default: False)
    • temperature (FLOAT): Sampling temperature (default: 0.2)

    Outputs:

    • description (STRING): Full video description
    • info (STRING): Processing metadata

    How It Works:

    1. Resolves video path and validates file
    2. Loads DAM-3B-Video model (cached after first load)
    3. Samples frames uniformly across video
    4. Analyzes entire video frame (full-frame mask)
    5. Generates comprehensive description of video content
    6. Returns detailed analysis

    Use Cases:

    • Comprehensive video content analysis
    • Scene understanding and description
    • Action and event detection
    • Alternative to Qwen3-VL with different model capabilities

    Performance:

    • Model size: ~7GB
    • Inference time: ~10-20 seconds (8 frames)
    • Works independently from Qwen3-VL node

    Roadmap

    Phase 1: Qwen3-VL Integration āœ…

    • āœ… Model loader implementation with caching
    • āœ… Video validation and info extraction
    • āœ… Inference pipeline with Qwen3-VL
    • āœ… Error handling and cleanup
    • āœ… Smart path resolution with ComfyUI/input/ search
    • āœ… Performance optimization (removed tensor conversion overhead)
    • āœ… Analysis type presets (detailed/summary/keywords)

    Phase 2: DAM Full Video Analysis āœ…

    • āœ… DAM model loader with singleton caching
    • āœ… Full video inference wrapper
    • āœ… ComfyUI node integration
    • āœ… Automatic full-frame mask generation
    • āœ… Multi-frame video processing
    • āœ… Analysis type presets (detailed/summary/action)
    • āœ… Simplified node interface (removed region_points parameter)

    Phase 3: Advanced Features (Planned)

    • [ ] Dual-model combination node (Qwen3-VL + DAM)
    • [ ] Batch processing support for multiple videos
    • [ ] Video timestamp-based analysis
    • [ ] Advanced prompt templates library
    • [ ] Performance optimization and caching improvements

    Phase 4: Production Ready (Future)

    • [ ] Comprehensive testing
    • [ ] Example workflows
    • [ ] Performance benchmarks
    • [ ] Community deployment

    Requirements

    • Python 3.8+
    • PyTorch 2.0+
    • NVIDIA GPU with 16GB+ VRAM (recommended)
    • ComfyUI

    Dependencies

    See requirements.txt for full list:

    • torch>=2.0.0
    • transformers>=4.50.0
    • accelerate>=0.20.0
    • qwen-vl-utils
    • opencv-python
    • pillow
    • numpy

    Hardware Recommendations

    | Configuration | GPU | VRAM | Performance | |--------------|-----|------|-------------| | Minimum | RTX 3060 | 12GB | Basic (with quantization) | | Recommended | RTX 4090 | 24GB | Good (FP16) | | Optimal | A100 | 40GB+ | Excellent (FP16, batch) |

    License

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0
    

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

    Support

    • GitHub Issues: Report bugs or request features
    • Documentation: Check README and code comments

    Acknowledgments

    • ComfyUI - Node-based UI for Stable Diffusion
    • Qwen-VL - Vision-language model by Alibaba
    • NVIDIA DAM - Description Anything Model

    Changelog

    v1.1.1 (2025-10-28)

    • āœ… DAM Node: Removed region_points parameter for simplified interface
    • āœ… DAM Node: Changed to full video analysis (automatic full-frame mask)
    • āœ… DAM Node: Updated prompts from "marked region" to "this video"
    • āœ… Updated installation guide with --no-deps explanation for describe-anything
    • āœ… Fixed dependency conflicts between describe-anything and ComfyUI
    • āœ… Fixed DAM model parameter names and prompt format (<image> tag)
    • āœ… Fixed video path resolution for ComfyUI Desktop App custom directories
    • āœ… Node display name updated: "DAM Region" → "DAM"

    v1.2.0 (2025-10-22)

    • āœ… Analysis type presets: detailed / summary / keywords
    • āœ… Optimized prompts and parameters for each analysis type
    • āœ… Custom prompt support with override capability
    • āœ… Automatic max_tokens and temperature configuration per type
    • āœ… Enhanced USAGE.md with analysis type examples

    v1.1.0 (2025-10-22)

    • āœ… Smart video path resolution (auto-search in ComfyUI/input/)
    • āœ… Performance optimization: removed tensor conversion (13x faster)
    • āœ… Simplified to path-only input (cleaner architecture)
    • āœ… Updated documentation with temperature vs denoise explanation
    • āœ… Added comprehensive usage guide (USAGE.md)

    v1.0.0 (2025-10-22)

    • āœ… Full Qwen3-VL integration
    • āœ… Model caching with singleton pattern
    • āœ… 4-bit quantization support
    • āœ… ComfyUI standard model directory structure
    • āœ… Comprehensive error handling

    v0.1.0-alpha (2025-10-21)

    • Initial project structure
    • Basic node skeleton
    • Node registration system
    • Documentation framework