ComfyUI Extension: ComfyUI-DeepseekOCR

Authored by Geo1230

Created

Updated

0 stars

A custom node that wraps DeepSeek-OCR as a ComfyUI plugin, providing powerful OCR recognition and document parsing capabilities.

Custom Nodes (0)

    README

    ComfyUI-DeepseekOCR

    English | δΈ­ζ–‡

    A custom node that wraps DeepSeek-OCR as a ComfyUI plugin, providing powerful OCR recognition and document parsing capabilities.

    Features

    Quick Start

    cd ComfyUI/custom_nodes/
    git clone https://github.com/Geo1230/ComfyUI-DeepseekOCR.git
    cd ComfyUI-DeepseekOCR
    

    Install Dependencies

    pip install -r requirements.txt
    

    Recommended transformers 4.46.3 If you encounter compatibility issues with transformers 4.55+, downgrade:

    pip install transformers==4.46.3 tokenizers==0.20.3
    

    Download Model

    Create directories and navigate:

    # 1. Navigate to ComfyUI's models directory
    cd ComfyUI\models
    
    # 2. Create deepseek-ocr directory (if it doesn't exist)
    mkdir deepseek-ocr
    cd deepseek-ocr
    
    # 3. Create model directory
    mkdir deepseek-ai_DeepSeek-OCR
    cd deepseek-ai_DeepSeek-OCR
    

    Download model to current directory:

    huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir . --repo-type model
    

    Note: Model will be downloaded to ComfyUI\models\deepseek-ocr\deepseek-ai_DeepSeek-OCR\ directory

    Or Use Automatic Download (Not recommended, less stable):

    Model will automatically download on first run of the Load node. Download progress is shown in the console.

    To disable automatic download, set environment variable:

    # Windows PowerShell
    $env:DPSK_AUTODOWNLOAD = "0"
    

    Usage

    Node 1: DeepSeek OCR: Load Model

    Loads and caches the model, outputs a model handle for use by the Run node.

    Parameters:

    • dtype: Data precision
      • bf16 (Recommended, default) - Balance of precision and performance
      • fp16 - Use when VRAM is insufficient
      • fp32 - Best compatibility but high VRAM usage
    • device: Runtime device (default: cuda)

    Node 2: DeepSeek OCR: Run

    Performs OCR inference and outputs recognized text.

    Parameters:

    • model: Model handle (from Load node)
    • image: Input image (ComfyUI IMAGE type)
    • task: Task mode
      • Free OCR: General OCR recognition
      • Convert to Markdown: Document to Markdown conversion
      • Parse Figure: Parse charts and figures
      • Locate by Reference: Locate specified objects (requires reference_text)
    • resolution: Resolution preset
      • Gundam (Recommended for long documents): 1024/640/crop/compress
      • Tiny: 512x512
      • Small: 640x640
      • Base: 1024x1024
      • Large: 1280x1280
    • output_type: Output type (determines what is returned)
      • all (default): Output both text and visualization image
      • text: Text only, image output is original image
      • image: Visualization image only (suitable for Locate task)
    • reference_text: (Optional) Only when task=Locate by Reference, description of object to locate
    • box_color: (Optional) Detection box color, default red
      • Preset colors: red, green, blue, yellow, cyan, magenta, white, black
      • Custom RGB: e.g., "255,0,0" (red), "0,255,0" (green)
    • box_width: (Optional) Detection box width, default 2 px, range 1-10

    Outputs:

    • text: Recognized text content (STRING)
      • Contains original markers (e.g., <|ref|>...<|/ref|><|det|>[[coordinates]]<|/det|>)
    • visualization: Visualization image (IMAGE)
      • Locate by Reference task: Image with custom-styled bounding boxes
      • Other tasks: Returns original input image

    Screenshots

    Usage Guide

    πŸ’‘ Output Type Selection

    • all (default): Output both text and visualization image
    • text: Text only (OCR/Markdown conversion)
    • image: Visualization image only (Locate task)

    🎯 Locate by Reference Task

    Parameter Configuration:

    • task: Select Locate by Reference
    • reference_text: Enter the object to locate
      • Chinese examples: "δ»·ζ Ό", "ζ ‡ι’˜", "二维码"
      • English examples: "the teacher", "price", "table", "logo"

    🎨 Custom Bounding Box Style

    Supported Preset Colors (16 types):

    | Color Name | RGB | Preview | Color Name | RGB | Preview | |------------|-----|---------|------------|-----|---------| | red | 255,0,0 | πŸ”΄ Red (default) | orange | 255,165,0 | 🟠 Orange | | green | 0,255,0 | 🟒 Green | purple | 128,0,128 | 🟣 Purple | | blue | 0,0,255 | πŸ”΅ Blue | pink | 255,192,203 | 🩷 Pink | | yellow | 255,255,0 | 🟑 Yellow | lime | 0,255,0 | 🟒 Lime | | cyan | 0,255,255 | πŸ”΅ Cyan | navy | 0,0,128 | πŸ”΅ Navy | | magenta | 255,0,255 | 🟣 Magenta | teal | 0,128,128 | πŸ”΅ Teal | | white | 255,255,255 | βšͺ White | gold | 255,215,0 | 🟑 Gold | | black | 0,0,0 | ⚫ Black | silver | 192,192,192 | βšͺ Silver |

    Custom RGB Format:

    • Input format: "R,G,B" (e.g., "255,128,0" for dark orange)
    • Range: 0-255

    Box Width:

    • box_width: 1-10 pixels (default 2px)

    Example Configuration:

    box_color = "red"          β†’ Red 2px border (default)
    box_color = "orange"       β†’ Orange border
    box_color = "255,105,180"  β†’ Hot pink border
    box_width = 5              β†’ 5px thick border
    

    πŸ“Œ Basic Workflow

    LoadImage
       ↓
    DeepSeek OCR: Load Model  
       ↓
    DeepSeek OCR: Run
       β”œβ”€β†’ text β†’ Display Text / Save Text
       └─→ visualization β†’ Preview Image / Save Image
    

    πŸ“š Typical Use Cases

    1. Document to Markdown

    task = "Convert to Markdown"
    resolution = "Gundam"
    β†’ Output formatted Markdown text
    

    2. Figure Parsing

    task = "Parse Figure"
    resolution = "Base"
    β†’ Extract structured data from tables and charts
    

    3. Object Localization

    task = "Locate by Reference"
    reference_text = "哆啦Aζ’¦"
    box_color = "red"
    box_width = 2
    β†’ Text contains coordinates, image shows red box annotations
    
    ComfyUI/
    β”œβ”€ models/
    β”‚  └─ deepseek-ocr/                    # ← Fixed weights directory
    β”‚     β”œβ”€ deepseek-ai_DeepSeek-OCR/     # Model weights
    β”‚     └─ hf_cache/                     # HuggingFace cache
    β”œβ”€ output/
    β”‚  └─ DeepseekOCR/                     # Output directory (visualization results)
    β”‚     └─ 2025-11-05_20-31-00/          # Timestamp directory
    β”œβ”€ log/
    β”‚  └─ deepseek_ocr.log                 # Plugin logs
    └─ custom_nodes/
       └─ ComfyUI-DeepseekOCR/
          β”œβ”€ __init__.py
          β”œβ”€ config.py
          β”œβ”€ model_manager.py
          β”œβ”€ nodes.py
          β”œβ”€ resolver.py
          β”œβ”€ io_utils.py
          β”œβ”€ tool/
          β”‚  └─ download_weights.py
          β”œβ”€ requirements.txt
          └─ README.md
    

    Logging

    Plugin logs are located at: ComfyUI/log/deepseek_ocr.log

    Key log contents:

    • Model weight download progress
    • Model loading status (device/dtype/attn_impl)
    • Cache hit information
    • Fallback strategy trigger records
    • Error details and suggestions

    This project is licensed under the MIT License. See the LICENSE file for details.

    Acknowledgments

    • DeepSeek AI - For providing the powerful DeepSeek-OCR model
    • ComfyUI - Excellent node-based UI framework
    • All contributors and users