ComfyUI Extension: ComfyUI-Sa2VA-XJ
Simple implementation of a/ByteDance Sa2VA nodes for ComfyUI.
Custom Nodes (0)
README
ComfyUI-Sa2VA-XJ
中文文档 | English
Simple implementation of ByteDance Sa2VA nodes for ComfyUI.
Features
- ✅ Three nodes: Image V1, Image V2, and Video processing
- ✅ ComfyUI-compliant model paths: Manual download support, local model priority
- ✅ VITMatte post-processing (V2): AI-powered alpha matting
- ✅ Configurable mask threshold: Control mask quality (0.0-1.0, step 0.05)
- ✅ Morphological operations (V1): Opening, closing, erode, dilate
- ✅ 8-bit quantization: Save VRAM
- ✅ Flash attention: Faster inference
- ✅ Model unloading: Free VRAM after inference
Installation
cd ComfyUI/custom_nodes
git clone https://github.com/alexjx/ComfyUI-Sa2VA-XJ.git
cd ComfyUI-Sa2VA-XJ
pip install -r requirements.txt
Optional:
pip install bitsandbytes # For 8-bit quantization
pip install flash-attn --no-build-isolation # For flash attention
pip install opencv-python # For morphological operations
Restart ComfyUI after installation.
Model Installation
You can download models manually (recommended) or let the node auto-download from HuggingFace on first use.
Option 1: Manual Download (Recommended)
Manual download gives you control over model locations, enables offline use, and avoids duplicate downloads.
Directory Structure:
ComfyUI/models/
├── sa2va/
│ ├── ByteDance/
│ │ ├── Sa2VA-Qwen3-VL-4B/
│ │ ├── Sa2VA-InternVL3-2B/
│ │ ├── Sa2VA-Qwen2_5-VL-3B/
│ │ ├── Sa2VA-Qwen2_5-VL-7B/
│ │ ├── Sa2VA-InternVL3-8B/
│ │ └── Sa2VA-InternVL3-14B/
│ └── (or without ByteDance/ prefix)
└── vitmatte/
└── hustvl/
└── vitmatte-small-composition-1k/
Method 1: Using huggingface-cli (Recommended)
# Install HuggingFace CLI
pip install -U huggingface_hub
# Download Sa2VA model (example: Qwen3-VL-4B)
huggingface-cli download ByteDance/Sa2VA-Qwen3-VL-4B \
--local-dir ComfyUI/models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B
# Download VITMatte model (for V2 node)
huggingface-cli download hustvl/vitmatte-small-composition-1k \
--local-dir ComfyUI/models/vitmatte/hustvl/vitmatte-small-composition-1k
Method 2: Using git-lfs
# Install git-lfs
git lfs install
# Download Sa2VA model
cd ComfyUI/models/sa2va/ByteDance
git clone https://huggingface.co/ByteDance/Sa2VA-Qwen3-VL-4B
# Download VITMatte model
cd ComfyUI/models/vitmatte/hustvl
git clone https://huggingface.co/hustvl/vitmatte-small-composition-1k
Alternative Directory Structure:
You can also download without the organization prefix:
# Download directly to model name folder
huggingface-cli download ByteDance/Sa2VA-Qwen3-VL-4B \
--local-dir ComfyUI/models/sa2va/Sa2VA-Qwen3-VL-4B
Both structures are supported:
models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B/✅models/sa2va/Sa2VA-Qwen3-VL-4B/✅
Option 2: Automatic Download
Models will auto-download from HuggingFace to ~/.cache/huggingface/hub/ on first use if not found locally.
Note: This may take time depending on your internet connection and will use HuggingFace's cache directory.
Model Sizes
Plan your storage accordingly:
| Model | Download Size | Disk Size (Installed) | VRAM (fp16) | VRAM (8-bit) | | ----------------------------- | ------------- | --------------------- | ----------- | ------------ | | Sa2VA Models | | InternVL3-2B | ~4GB | ~4.5GB | ~6GB | ~4GB | | Qwen2_5-VL-3B | ~6GB | ~6.5GB | ~8GB | ~5GB | | Qwen3-VL-4B (default) | ~8GB | ~9GB | ~10GB | ~6GB | | Qwen2_5-VL-7B | ~14GB | ~15GB | ~16GB | ~10GB | | InternVL3-8B | ~16GB | ~17GB | ~18GB | ~11GB | | InternVL3-14B | ~28GB | ~30GB | ~30GB | ~18GB | | VITMatte Model | | vitmatte-small-composition-1k | ~300MB | ~350MB | +2GB | +2GB |
Storage Tips:
- Start with Qwen3-VL-4B (default) - good balance of quality and speed
- Use 8-bit quantization to reduce VRAM usage
- VITMatte is optional (V2 node only) for enhanced edge quality
Requirements
- Python 3.8+
- PyTorch 2.0+
- transformers >= 4.57.0
- CUDA 11.8+ (GPU)
- VRAM: 8GB+ (2B-4B models), 16GB+ (7B-8B models), 24GB+ (14B models)
Nodes
Sa2VA Image Segmentation (V1)
Inputs:
model_name: Model selection (default: Qwen3-VL-4B)image: Input imagesegmentation_prompt: Text descriptionthreshold: Binary threshold (0.0-1.0, default: 0.5)use_8bit: 8-bit quantization (default: True)use_flash_attn: Flash attention (default: True)unload: Unload model after inference (default: True)morph: Morphological operation (none/opening/closing/erode/dilate)erode_kernel,dilate_kernel,iterations: Morphology parameters
Outputs:
text_output: Text descriptionmasks: Segmentation masks
Sa2VA Image Segmentation V2
VITMatte-based post-processing for smooth alpha mattes.
Inputs:
model_name: Model selection (default: Qwen3-VL-4B)image: Input imagesegmentation_prompt: Text descriptionthreshold: Binary threshold (0.0-1.0, default: 0.5)use_8bit: 8-bit quantization (default: True)use_flash_attn: Flash attention (default: True)unload: Unload model after inference (default: True)process_detail: Enable VITMatte (default: True)detail_erode: Trimap erosion size (1-255, default: 6)detail_dilate: Trimap dilation size (1-255, default: 6)black_point: Histogram black point (0.01-0.98, default: 0.15)white_point: Histogram white point (0.02-0.99, default: 0.99)max_megapixels: Max resolution (0.5-10.0, default: 2.0)
Outputs:
text_output: Text descriptionmasks: Alpha mattes
V1 vs V2:
- V1: Fast, morphological operations, solid objects
- V2: Slower, VITMatte refinement, hair/fur/glass/complex edges
Sa2VA Video Segmentation
Process video frames or image batches.
Inputs:
model_name: Model selectionimages: Input frames (batch)segmentation_prompt: Text descriptionthreshold: Binary threshold (0.0-1.0, default: 0.7)use_8bit,use_flash_attn,unload: Same as V1morph,erode_kernel,dilate_kernel,iterations: Morphology parameters
Outputs:
text_output: Video descriptionmasks: Segmentation masks for all frames
Supported Models
All models are from ByteDance's Sa2VA family:
| Model | Parameters | Notes | | --------------- | ---------- | --------------------------------- | | InternVL3-2B | 2B | Smallest, fastest | | Qwen2_5-VL-3B | 3B | Good for low VRAM | | Qwen3-VL-4B | 4B | Default - Best balance | | Qwen2_5-VL-7B | 7B | Higher quality | | InternVL3-8B | 8B | Advanced features | | InternVL3-14B | 14B | Best quality, requires 24GB+ VRAM |
See Model Installation section for download instructions and VRAM requirements.
Troubleshooting
"transformers >= 4.57.0 required"
pip install transformers>=4.57.0 --upgrade
"No module named 'qwen_vl_utils'"
pip install qwen_vl_utils
Model Downloads
The node will log where it's loading models from:
Found local Sa2VA model at: /path/to/model- Using local model ✅Local model not found. Will download from HuggingFace: ...- Auto-downloading from HF
To verify model installation:
# Check if Sa2VA models exist
ls -la ComfyUI/models/sa2va/
# Check if VITMatte model exists
ls -la ComfyUI/models/vitmatte/
# Each model directory should contain config.json
ls ComfyUI/models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B/config.json
Slow First Load
- First run may download models from HuggingFace (can take 5-30 minutes)
- Check console logs to see download progress
- Consider manual download (see Model Installation)
"CUDA Out of Memory"
- Enable
use_8bit - Use smaller model (2B/4B)
- Ensure
unload = True - Lower VITMatte
max_megapixels(V2 only)
"No masks generated"
- Try more specific prompts
- Adjust
threshold(try 0.3-0.7)
Technical Details
Model Loading Behavior
The nodes follow ComfyUI's standard model loading pattern:
- Check local models first: Looks in
ComfyUI/models/sa2va/andComfyUI/models/vitmatte/ - Fallback to HuggingFace: Auto-downloads to
~/.cache/huggingface/hub/if not found locally - Supports both directory structures:
- With org prefix:
models/sa2va/ByteDance/Sa2VA-Qwen3-VL-4B/ - Without org prefix:
models/sa2va/Sa2VA-Qwen3-VL-4B/
- With org prefix:
Benefits:
- ✅ Control over model storage locations
- ✅ Offline use after manual download
- ✅ No duplicate downloads
- ✅ Backward compatible with auto-download
Raw Mask Probabilities
Sa2VA outputs raw sigmoid probabilities (0.0-1.0) instead of binary masks. The threshold parameter controls binarization.
8-bit Quantization
Quantizes language model backbone while skipping vision components (visual, grounding_encoder, text_hidden_fcs) to avoid errors.
VITMatte (V2 Only)
VITMatte is a Vision Transformer-based alpha matting model that produces smooth, professional-quality masks.
Why VITMatte produces better masks:
- Trimap-guided: Creates a 3-zone map (definite foreground, uncertain region, definite background) from Sa2VA's rough mask
- AI refinement: Neural network predicts precise alpha values in uncertain regions (edges, hair, semi-transparent areas)
- Gradient transitions: Produces smooth 0-1 gradients instead of hard 0/1 boundaries
- Detail preservation: Captures fine structures like individual hair strands, fur texture, and glass transparency
Processing pipeline:
- Generate trimap from sigmoid mask (erode/dilate)
- VITMatte AI inference for alpha prediction in uncertain regions
- Histogram remapping for contrast enhancement
Trade-offs:
- +2-5s processing time per mask
- +2GB VRAM usage
- Best for complex edges; V1 is faster for simple objects
Links
License
MIT
Credits
- Based on ByteDance Sa2VA
- Inspired by ComfyUI-Sa2VA
- VITMatte implementation adapted from ComfyUI_LayerStyle_Advance