ComfyUI Extension: ComfyUI-Sa2VA-XJ

Authored by alexjx

Created 26 days ago

Updated 10 days ago

10 stars

Simple implementation of a/ByteDance Sa2VA nodes for ComfyUI.

Custom Nodes (0)

README

ComfyUI-Sa2VA-XJ

中文文档 | English

Simple implementation of ByteDance Sa2VA nodes for ComfyUI.

Features

✅ Three nodes: Image V1, Image V2, and Video processing
✅ ComfyUI-compliant model paths: Manual download support, local model priority
✅ VITMatte post-processing (V2): AI-powered alpha matting
✅ Configurable mask threshold: Control mask quality (0.0-1.0, step 0.05)
✅ Morphological operations (V1): Opening, closing, erode, dilate
✅ 8-bit quantization: Save VRAM
✅ Flash attention: Faster inference
✅ Model unloading: Free VRAM after inference

Installation

cd ComfyUI/custom_nodes
git clone https://github.com/alexjx/ComfyUI-Sa2VA-XJ.git
cd ComfyUI-Sa2VA-XJ
pip install -r requirements.txt

Optional:

pip install bitsandbytes              # For 8-bit quantization
pip install flash-attn --no-build-isolation  # For flash attention
pip install opencv-python              # For morphological operations

Restart ComfyUI after installation.

Model Installation

You can download models manually (recommended) or let the node auto-download from HuggingFace on first use.

Option 1: Manual Download (Recommended)

Manual download gives you control over model locations, enables offline use, and avoids duplicate downloads.

Directory Structure:

ComfyUI/models/
├── sa2va/
│   ├── ByteDance/
│   │   ├── Sa2VA-Qwen3-VL-4B/
│   │   ├── Sa2VA-InternVL3-2B/
│   │   ├── Sa2VA-Qwen2_5-VL-3B/
│   │   ├── Sa2VA-Qwen2_5-VL-7B/
│   │   ├── Sa2VA-InternVL3-8B/
│   │   └── Sa2VA-InternVL3-14B/
│   └── (or without ByteDance/ prefix)
└── vitmatte/
    └── hustvl/
        └── vitmatte-small-composition-1k/

Method 1: Using huggingface-cli (Recommended)

# Install HuggingFace CLI
pip install -U huggingface_hub

# Download Sa2VA model (example: Qwen3-VL-4B)
huggingface-cli download ByteDance/Sa2VA-Qwen3-VL-4B \
  --local-dir ComfyUI/models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B

# Download VITMatte model (for V2 node)
huggingface-cli download hustvl/vitmatte-small-composition-1k \
  --local-dir ComfyUI/models/vitmatte/hustvl/vitmatte-small-composition-1k

Method 2: Using git-lfs

# Install git-lfs
git lfs install

# Download Sa2VA model
cd ComfyUI/models/sa2va/ByteDance
git clone https://huggingface.co/ByteDance/Sa2VA-Qwen3-VL-4B

# Download VITMatte model
cd ComfyUI/models/vitmatte/hustvl
git clone https://huggingface.co/hustvl/vitmatte-small-composition-1k

Alternative Directory Structure:

You can also download without the organization prefix:

# Download directly to model name folder
huggingface-cli download ByteDance/Sa2VA-Qwen3-VL-4B \
  --local-dir ComfyUI/models/sa2va/Sa2VA-Qwen3-VL-4B

Both structures are supported:

models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B/ ✅
models/sa2va/Sa2VA-Qwen3-VL-4B/ ✅

Option 2: Automatic Download

Models will auto-download from HuggingFace to ~/.cache/huggingface/hub/ on first use if not found locally.

Note: This may take time depending on your internet connection and will use HuggingFace's cache directory.

Model Sizes

Plan your storage accordingly:

| Model | Download Size | Disk Size (Installed) | VRAM (fp16) | VRAM (8-bit) | | ----------------------------- | ------------- | --------------------- | ----------- | ------------ | | Sa2VA Models | | InternVL3-2B | ~4GB | ~4.5GB | ~6GB | ~4GB | | Qwen2_5-VL-3B | ~6GB | ~6.5GB | ~8GB | ~5GB | | Qwen3-VL-4B (default) | ~8GB | ~9GB | ~10GB | ~6GB | | Qwen2_5-VL-7B | ~14GB | ~15GB | ~16GB | ~10GB | | InternVL3-8B | ~16GB | ~17GB | ~18GB | ~11GB | | InternVL3-14B | ~28GB | ~30GB | ~30GB | ~18GB | | VITMatte Model | | vitmatte-small-composition-1k | ~300MB | ~350MB | +2GB | +2GB |

Storage Tips:

Start with Qwen3-VL-4B (default) - good balance of quality and speed
Use 8-bit quantization to reduce VRAM usage
VITMatte is optional (V2 node only) for enhanced edge quality

Requirements

Python 3.8+
PyTorch 2.0+
transformers >= 4.57.0
CUDA 11.8+ (GPU)
VRAM: 8GB+ (2B-4B models), 16GB+ (7B-8B models), 24GB+ (14B models)

Nodes

Sa2VA Image Segmentation (V1)

Inputs:

model_name: Model selection (default: Qwen3-VL-4B)
image: Input image
segmentation_prompt: Text description
threshold: Binary threshold (0.0-1.0, default: 0.5)
use_8bit: 8-bit quantization (default: True)
use_flash_attn: Flash attention (default: True)
unload: Unload model after inference (default: True)
morph: Morphological operation (none/opening/closing/erode/dilate)
erode_kernel, dilate_kernel, iterations: Morphology parameters

Outputs:

text_output: Text description
masks: Segmentation masks

Sa2VA Image Segmentation V2

VITMatte-based post-processing for smooth alpha mattes.

Inputs:

model_name: Model selection (default: Qwen3-VL-4B)
image: Input image
segmentation_prompt: Text description
threshold: Binary threshold (0.0-1.0, default: 0.5)
use_8bit: 8-bit quantization (default: True)
use_flash_attn: Flash attention (default: True)
unload: Unload model after inference (default: True)
process_detail: Enable VITMatte (default: True)
detail_erode: Trimap erosion size (1-255, default: 6)
detail_dilate: Trimap dilation size (1-255, default: 6)
black_point: Histogram black point (0.01-0.98, default: 0.15)
white_point: Histogram white point (0.02-0.99, default: 0.99)
max_megapixels: Max resolution (0.5-10.0, default: 2.0)

Outputs:

text_output: Text description
masks: Alpha mattes

V1 vs V2:

V1: Fast, morphological operations, solid objects
V2: Slower, VITMatte refinement, hair/fur/glass/complex edges

Sa2VA Video Segmentation

Process video frames or image batches.

Inputs:

model_name: Model selection
images: Input frames (batch)
segmentation_prompt: Text description
threshold: Binary threshold (0.0-1.0, default: 0.7)
use_8bit, use_flash_attn, unload: Same as V1
morph, erode_kernel, dilate_kernel, iterations: Morphology parameters

Outputs:

text_output: Video description
masks: Segmentation masks for all frames

Supported Models

All models are from ByteDance's Sa2VA family:

| Model | Parameters | Notes | | --------------- | ---------- | --------------------------------- | | InternVL3-2B | 2B | Smallest, fastest | | Qwen2_5-VL-3B | 3B | Good for low VRAM | | Qwen3-VL-4B | 4B | Default - Best balance | | Qwen2_5-VL-7B | 7B | Higher quality | | InternVL3-8B | 8B | Advanced features | | InternVL3-14B | 14B | Best quality, requires 24GB+ VRAM |

See Model Installation section for download instructions and VRAM requirements.

Troubleshooting

"transformers >= 4.57.0 required"

pip install transformers>=4.57.0 --upgrade

"No module named 'qwen_vl_utils'"

pip install qwen_vl_utils

Model Downloads

The node will log where it's loading models from:

Found local Sa2VA model at: /path/to/model - Using local model ✅
Local model not found. Will download from HuggingFace: ... - Auto-downloading from HF

To verify model installation:

# Check if Sa2VA models exist
ls -la ComfyUI/models/sa2va/

# Check if VITMatte model exists
ls -la ComfyUI/models/vitmatte/

# Each model directory should contain config.json
ls ComfyUI/models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B/config.json

Slow First Load

First run may download models from HuggingFace (can take 5-30 minutes)
Check console logs to see download progress
Consider manual download (see Model Installation)

"CUDA Out of Memory"

Enable use_8bit
Use smaller model (2B/4B)
Ensure unload = True
Lower VITMatte max_megapixels (V2 only)

"No masks generated"

Try more specific prompts
Adjust threshold (try 0.3-0.7)

Technical Details

Model Loading Behavior

The nodes follow ComfyUI's standard model loading pattern:

Check local models first: Looks in ComfyUI/models/sa2va/ and ComfyUI/models/vitmatte/
Fallback to HuggingFace: Auto-downloads to ~/.cache/huggingface/hub/ if not found locally
Supports both directory structures:
- With org prefix: models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B/
- Without org prefix: models/sa2va/Sa2VA-Qwen3-VL-4B/

Benefits:

✅ Control over model storage locations
✅ Offline use after manual download
✅ No duplicate downloads
✅ Backward compatible with auto-download

Raw Mask Probabilities

Sa2VA outputs raw sigmoid probabilities (0.0-1.0) instead of binary masks. The threshold parameter controls binarization.

8-bit Quantization

Quantizes language model backbone while skipping vision components (visual, grounding_encoder, text_hidden_fcs) to avoid errors.

VITMatte (V2 Only)

VITMatte is a Vision Transformer-based alpha matting model that produces smooth, professional-quality masks.

Why VITMatte produces better masks:

Trimap-guided: Creates a 3-zone map (definite foreground, uncertain region, definite background) from Sa2VA's rough mask
AI refinement: Neural network predicts precise alpha values in uncertain regions (edges, hair, semi-transparent areas)
Gradient transitions: Produces smooth 0-1 gradients instead of hard 0/1 boundaries
Detail preservation: Captures fine structures like individual hair strands, fur texture, and glass transparency

Processing pipeline:

Generate trimap from sigmoid mask (erode/dilate)
VITMatte AI inference for alpha prediction in uncertain regions
Histogram remapping for contrast enhancement

Trade-offs:

+2-5s processing time per mask
+2GB VRAM usage
Best for complex edges; V1 is faster for simple objects

License

MIT

Credits

Based on ByteDance Sa2VA
Inspired by ComfyUI-Sa2VA
VITMatte implementation adapted from ComfyUI_LayerStyle_Advance

ComfyUI Extension: ComfyUI-Sa2VA-XJ

Custom Nodes (0)

README

ComfyUI-Sa2VA-XJ

Features

Installation

Model Installation

Option 1: Manual Download (Recommended)

Option 2: Automatic Download

Model Sizes

Requirements

Nodes

Sa2VA Image Segmentation (V1)

Sa2VA Image Segmentation V2

Sa2VA Video Segmentation

Supported Models

Troubleshooting

Technical Details

Model Loading Behavior

Raw Mask Probabilities

8-bit Quantization

VITMatte (V2 Only)

Links

License

Credits