ComfyUI Extension: comfy-modelopt
NVIDIA ModelOpt integration for ComfyUI - Quantize and optimize diffusion models with INT8/FP8/INT4
Custom Nodes (0)
README
NVIDIA ModelOpt for ComfyUI
⚠️ PROJECT STATUS: ON HOLD (2025-11-10)
Current Status: This project is on hold pending PyTorch compatibility fixes.
Issue: ModelOpt quantized models have known compatibility issues with PyTorch/TorchScript that prevent full end-to-end inference in ComfyUI. While quantization and save/load work correctly, encode/decode operations are currently blocked.
What Works:
- ✅ Model quantization (v0.3.0)
- ✅ Save/Load pipeline (v0.4.0)
What's Blocked:
- ❌ Full inference in ComfyUI (pending PyTorch updates)
Alternative: Works in Diffusers wrapper (confirmed by community testing)
Recommendation: Wait for PyTorch upstream fixes before production use.
See RESEARCH_NOTES.md for technical details.
Quantize and optimize Stable Diffusion models with NVIDIA ModelOpt directly in ComfyUI. Achieve ~2x faster inference with INT8/FP8 quantization while maintaining image quality. Works with various diffusion model architectures including SDXL, SD1.5, SD3, and more.
✨ Features
- 🚀 Fast Inference: ~2x speedup with INT8/FP8 quantization
- 💾 Memory Efficient: Up to 50% VRAM reduction
- 🎨 Quality Preserved: <3% quality loss with proper calibration
- 🔧 Easy to Use: Quantize models directly in ComfyUI workflows
- 💻 Cross-Platform: Works on Linux, Windows (with some limitations)
📦 Included Nodes
| Node | Description | Category | |------|-------------|----------| | ModelOptUNetLoader | Load pre-quantized UNet models | loaders/modelopt | | ModelOptQuantizeUNet | Quantize UNet to INT8/FP8/INT4 | modelopt | | ModelOptSaveQuantized | Save quantized models | modelopt | | ModelOptCalibrationHelper | Collect calibration data | modelopt |
🎯 Model Compatibility
This node pack works with various diffusion model architectures. Quantization support depends on the underlying NVIDIA ModelOpt library.
For the latest model compatibility information, see the official NVIDIA ModelOpt documentation.
Expected Performance:
- ~2x faster inference with INT8/FP8 quantization
- ~50% VRAM reduction with 8-bit formats
- ~75% VRAM reduction with INT4 (experimental)
- <3% quality loss with proper calibration
Tip: If you encounter issues with a specific model, try updating ModelOpt:
pip install --upgrade nvidia-modelopt[all]
💻 Hardware Requirements
GPU Requirements
| Quantization | Minimum GPU | Compute Capability | Example GPUs | |--------------|-------------|-------------------|--------------| | INT8 | Turing | SM 7.5+ | RTX 2060+, T4, RTX 3000+, RTX 4000+ | | FP8 | Ada Lovelace or Hopper | SM 8.9+ | RTX 4060+, H100 | | INT4 | Turing | SM 7.5+ | RTX 2060+, T4, RTX 3000+, RTX 4000+ |
Recommended for best experience:
- GPU: RTX 4070+ (for FP8 support)
- VRAM: 8GB+ (12GB+ recommended for larger models)
- System RAM: 16GB+ (32GB+ recommended)
Software Requirements
- Operating System: Linux (primary), Windows 10/11
- Python: 3.10, 3.11, or 3.12
- CUDA: 12.0 or higher
- PyTorch: 2.0 or higher with CUDA support
- ComfyUI: Latest version
🚀 Installation
Method 1: ComfyUI Manager (Recommended)
- Open ComfyUI Manager
- Search for "ModelOpt" or "NVIDIA ModelOpt"
- Click "Install"
- Restart ComfyUI
Method 2: Manual Installation
cd ComfyUI/custom_nodes
git clone https://github.com/EnragedAntelope/comfy-modelopt.git
cd comfy-modelopt
pip install -r requirements.txt
# For Linux (optional, for faster quantization):
pip install -r requirements-linux.txt
# For Windows (optional):
pip install -r requirements-windows.txt
Method 3: Install PyTorch with CUDA First (Recommended)
# Install PyTorch with CUDA 12.4
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# Then install ModelOpt and dependencies
cd ComfyUI/custom_nodes/comfy-modelopt
pip install -r requirements.txt
📖 Usage Guide
Basic Workflow: Quantize a UNet
- Load your model using standard ComfyUI checkpoint loader
- Add ModelOptQuantizeUNet node
- Connect model to the quantizer
- Configure quantization:
precision: Choose INT8 (most compatible) or FP8 (best quality, RTX 4000+)calibration_steps: 32 for testing, 64-128 for production
- Run workflow - quantization will take 2-10 minutes
- Save quantized model using ModelOptSaveQuantized node
- Reload quantized model using ModelOptUNetLoader for faster inference
Advanced Workflow: Use Real Calibration Data
For best quality, use real latent samples instead of random calibration:
- Create your normal generation workflow (prompt, latent, sampler, etc.)
- Add ModelOptCalibrationHelper node after your KSampler
- Connect latent output to calibration helper
- Generate 32-64 images to collect calibration samples
- Connect calibration data to ModelOptQuantizeUNet node
- Run quantization with your collected calibration data
Loading Quantized Models
- Use ModelOptUNetLoader instead of regular checkpoint loader
- Load VAE and CLIP separately using standard ComfyUI loaders
- ModelOpt only quantizes the UNet, not VAE/CLIP
- Connect to KSampler and generate as normal
- Enjoy ~2x faster inference!
🎨 Example Workflows
See the examples/ folder for ready-to-use workflow JSON files:
quantize_basic.json- Basic model quantization workflowquantize_with_calibration.json- Advanced calibration workflowload_quantized_model.json- Using quantized models
⚙️ Node Reference
ModelOptQuantizeUNet
Inputs:
model(MODEL): UNet model to quantizeprecision(COMBO): int8 / fp8 / int4calibration_steps(INT): Number of calibration steps (8-512)calibration_data(LATENT, optional): Calibration samplesskip_layers(STRING, optional): Comma-separated layers to skip
Outputs:
quantized_model(MODEL): Quantized UNet model
Recommended Settings:
- Quick test: INT8, 32 steps
- Production: INT8 or FP8, 64-128 steps
- Best quality: FP8, 256+ steps (requires RTX 4000+)
ModelOptUNetLoader
Inputs:
unet_name(COMBO): Select quantized UNet frommodels/modelopt_unet/precision(COMBO): auto / fp8 / fp16 / int8 / int4enable_caching(BOOLEAN): Cache model in memory
Outputs:
model(MODEL): Loaded quantized UNet
Note: Load VAE and CLIP separately using standard ComfyUI loaders.
ModelOptSaveQuantized
Inputs:
model(MODEL): Quantized model to savefilename(STRING): Output filenamesave_format(COMBO): safetensors / pytorchmetadata(STRING, optional): JSON metadata
Saves to: ComfyUI/models/modelopt_unet/
ModelOptCalibrationHelper
Inputs:
latent(LATENT): Latent samples to collectmax_samples(INT): Maximum samples to collect (8-512)
Outputs:
calibration_data(LATENT): Collected calibration samples
🔧 Troubleshooting
"No CUDA device available"
- Ensure NVIDIA GPU is installed and recognized
- Check CUDA drivers:
nvidia-smi - Reinstall PyTorch with CUDA:
pip install torch --index-url https://download.pytorch.org/whl/cu124
"FP8 requires Compute Capability 8.9+"
- FP8 only works on RTX 4000 series (Ada Lovelace) or H100 (Hopper)
- Use INT8 instead for older GPUs (RTX 2000/3000)
"Out of VRAM"
- Reduce
calibration_steps(try 16 or 32) - Close other applications
- Use INT4 for maximum VRAM savings
- Try a smaller/more efficient model architecture if available
"Quantization is very slow"
- Install Triton (Linux):
pip install triton - Reduce
calibration_stepsfor testing - Quantization is one-time process, inference will be faster
"Image quality is worse"
- Increase
calibration_steps(try 128 or 256) - Use real calibration data (ModelOptCalibrationHelper)
- Try FP8 instead of INT8 (requires RTX 4000+)
- Some quality loss (<3%) is normal with quantization
📊 Performance Benchmarks
Example benchmarks demonstrate typical performance improvements:
Example: SDXL-based model on RTX 4090 (1024x1024, 20 steps):
| Configuration | Time/Image | VRAM | Quality | |---------------|------------|------|---------| | FP16 (baseline) | 3.2s | ~8GB | 100% | | INT8 quantized | 1.7s | ~4GB | ~98% | | FP8 quantized | 1.6s | ~4GB | ~99% |
Example: SD1.5-based model on RTX 3080 (512x512, 20 steps):
| Configuration | Time/Image | VRAM | Quality | |---------------|------------|------|---------| | FP16 (baseline) | 1.1s | ~4GB | 100% | | INT8 quantized | 0.6s | ~2GB | ~97% |
Note: Performance varies significantly by model architecture, GPU, resolution, and workflow complexity. These are example benchmarks only.
⚠️ Limitations
- Component Quantization: Only UNet/diffusion model is quantized
- VAE and text encoders remain FP16 (load separately with standard nodes)
- Adapter Support: LoRAs, ControlNets, and IP-Adapters are applied after quantization
- Load the quantized model first, then apply adapters as usual
- Platform: Windows support is experimental (Triton acceleration not fully supported)
- First Run: Model quantization takes 2-10 minutes (one-time process, saves for reuse)
- Model Compatibility: Varies by ModelOpt version - see NVIDIA's documentation for current support
📚 Documentation
- Technical Guide: See docs/TECHNICAL_GUIDE.md for in-depth technical documentation
- ModelOpt Official: NVIDIA ModelOpt GitHub
- Report Issues: GitHub Issues
🤝 Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
📝 License
MIT License - see LICENSE for details
🙏 Acknowledgments
- NVIDIA for ModelOpt and TensorRT
- ComfyUI community for the excellent framework
- All contributors and testers
💬 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- ComfyUI Discord: #custom-nodes channel
⚡ Optimize your Stable Diffusion workflow with NVIDIA ModelOpt!
Made with ❤️ for the ComfyUI community