ComfyUI Extension: Emu35-Comfyui-Nodes
ComfyUI integration for BAAI's Emu3.5 multimodal models for text-to-image generation and multimodal understanding. (Description by CC)
Custom Nodes (0)
README
Emu3.5 ComfyUI Nodes
ComfyUI integration for BAAI's Emu3.5 multimodal models
✅ STATUS: WORKING - Text-to-image generation verified on December 7, 2025

Overview
This repository provides ComfyUI custom nodes for running BAAI's Emu3.5 models for text-to-image generation and multimodal understanding.
Models Supported:
- Emu3.5-Image (Text-to-Image) - ✅ Working
- Emu3.5-Base (Foundation model)
- Vision Tokenizer (VQ-VAE for image encoding/decoding)
Credits & Attribution
This project is built upon and inspired by:
-
BAAI Emu3.5 - Original model and codebase
- Paper: Emu3.5: Native Multimodal Models are World Learners
- Authors: Emu3.5 Team, Beijing Academy of Artificial Intelligence
- License: Apache 2.0
-
BAAI Emu3 - Predecessor model
All model weights and architecture remain property of BAAI under Apache 2.0 license.
Installation
Prerequisites
- ComfyUI installed
- Python 3.10+
- CUDA-capable GPU with 24GB+ VRAM (BF16) or 16GB+ (NF4 quantization)
- 100GB+ disk space for model weights
Method 1: Git Clone (Recommended)
cd ComfyUI/custom_nodes
git clone https://github.com/EricRollei/Emu35-Comfyui-Nodes.git emu35
cd emu35
# Clone the official Emu3.5 repository as a submodule
git clone https://github.com/baaivision/Emu3.5.git Emu3_5_repo
# Install dependencies
pip install -r requirements.txt
Method 2: Manual Download
- Download this repository
- Extract to
ComfyUI/custom_nodes/emu35/ - Download Emu3.5 repo and place in
emu35/Emu3_5_repo/ - Install requirements:
pip install -r requirements.txt
Download Model Weights
Place models in ComfyUI/models/emu35/:
Option A: Full BF16 (Recommended - 48GB+ VRAM)
# Full precision - best quality
huggingface-cli download BAAI/Emu3.5-Image --local-dir models/emu35/Emu3.5-Image
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir models/emu35/vision_tokenizer
Option B: NF4 Quantized (24GB VRAM)
# NF4 quantized version - works with 24-32GB VRAM
huggingface-cli download wikeeyang/Emu35-Image-NF4 --local-dir models/emu35/Emu3.5-Image-NF4
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir models/emu35/vision_tokenizer
Directory structure:
ComfyUI/
├── custom_nodes/
│ └── emu35/
│ ├── nodes.py
│ ├── __init__.py
│ ├── patched_tokenization_emu3.py
│ └── Emu3_5_repo/ (official repo)
└── models/
└── emu35/
├── Emu3.5-Image/ (or Emu3.5-Image-NF4/)
└── vision_tokenizer/
Usage
Nodes Available
1. Emu35 Loader
Loads the Emu3.5 model, tokenizer, and VQ model.
Inputs:
model_name: Select model folder (e.g., "Emu3.5-Image")vq_model_name: Vision tokenizer folder (usually "vision_tokenizer")precision: bf16 (default), fp16, fp32, or nf4 (quantized)
Outputs:
model: Loaded Emu3.5 modeltokenizer: Tokenizervq_model: Vision tokenizer for image encoding/decoding
2. Emu35 Sampler
Generates images from text prompts.
Inputs:
model,tokenizer,vq_model: From loaderprompt: Text description of desired imagenegative_prompt: What to avoid in generationwidth/height: Output dimensions (must be multiples of 64, range 256-2048)steps: Generation steps (1-200, default 50)cfg_scale: Classifier-free guidance (1.0-20.0, default 5.0 recommended)seed: Random seed for reproducibility
Outputs:
image: Generated imagetext_response: Any text generated alongside the image
Example Workflow
[Emu35 Loader] → [Emu35 Sampler] → [Preview Image]
↑
[Text Prompt: "a red apple on a table"]
Status Update (December 2025)
✅ WORKING - Text-to-image generation is now fully functional!
Key Fixes Applied
-
SDPA Attention Bug on Blackwell GPUs: SDPA attention produces noise/garbage on Blackwell architecture (sm_120) with CUDA 12.8. Fixed by using
attn_implementation="eager". -
Transformers 4.57+ Compatibility: The official Emu3.5 code was written for transformers 4.48. We've patched
modeling_emu3.pyto work with transformers 4.50+:- Added
GenerationMixininheritance forgenerate()method - Fixed
DynamicCacheAPI changes (get_usable_length→get_seq_length) - Fixed attention mask size mismatches with proper
-infpadding
- Added
-
Attention Mask Handling: Fixed critical bugs in mask padding:
- Use
-inf(not zeros) to mask non-existent positions - Keep FIRST tokens when trimming (history/prompt), not LAST
- Use
-
Prompt Format: Using official chat-style template:
<|extra_203|>You are a helpful assistant. USER: {prompt} ASSISTANT:
Tested Configuration
| Component | Version/Setting |
|-----------|-----------------|
| GPU | NVIDIA RTX 6000 Pro (Blackwell) |
| CUDA | 12.8 |
| PyTorch | 2.7+ |
| Transformers | 4.57.1 |
| Attention | eager (not SDPA) |
| Precision | BF16 |
Known Issues
-
VRAM Requirements: Full BF16 model needs 48GB+, NF4 quantized needs 24-32GB
-
Generation Speed: Autoregressive generation is slower than diffusion models (~10-15 min for 512x512)
-
Blackwell GPU + SDPA: SDPA attention produces corrupted outputs on Blackwell GPUs. Use
eagerattention (automatically set). -
NF4 Quantization: The
lm_headmust NOT be quantized or outputs will be garbage. Our loader verifies this.
GPU Compatibility
| GPU Architecture | SDPA | Eager | Recommended | |-----------------|------|-------|-------------| | Ampere (sm_80) | ✅ | ✅ | SDPA | | Ada Lovelace (sm_89) | ✅ | ✅ | SDPA | | Blackwell (sm_120) | ❌ | ✅ | Eager |
Technical Details
Architecture
- Base: Llama-style transformer (8B parameters)
- Training: 10T+ multimodal tokens from video frames + transcripts
- Image Tokenization: VQ-VAE (IBQ) with 262,144 codebook size
- Visual Tokens: Token IDs 151855-413998 (262,144 discrete codes)
- Resolution: Supports up to 2048x2048 (128x128 latents)
Special Token IDs
BOS = 151849 # <|extra_203|> Begin generation
EOS = 151850 # <|extra_204|> End generation
IMG = 151851 # <|image|>
BOI = 151852 # <|begin_of_image|>
EOI = 151853 # <|end_of_image|>
EOL = 151846 # <|extra_200|> End of line
VISUAL_START = 151855 # First visual token
Prompt Format
<|extra_203|>{prompt}<|extra_204|>{H}*{W}<|image|>
Where H, W are latent dimensions (height/16, width/16).
Troubleshooting
Issue: "Model has no 'generate' method"
Solution: This is automatically fixed. The patched modeling_emu3.py adds GenerationMixin inheritance.
Issue: "Tokenizer crashes on loading"
Solution: The patched tokenizer handles missing visual tokens automatically.
Issue: "Out of memory"
Solution:
- Use
nf4precision (requires ~24GB VRAM) - Reduce image dimensions (512x512 instead of 1024x1024)
- Enable
--lowvramin ComfyUI launch args
Issue: "Images are noise/garbage"
Solution:
- If on Blackwell GPU: Ensure
attn_implementation="eager"(automatic in our loader) - If NF4 quantized: Verify
lm_headis NOT quantized (check console output) - Verify model weights with
python verify_hashes.py(if available)
Issue: "Attention mask size mismatch"
Solution: Our patched modeling_emu3.py handles transformers 4.57+ cache API changes. If you still see this error, delete __pycache__ folders and restart.
Development
Files Overview
nodes.py: Main ComfyUI node implementationspatched_tokenization_emu3.py: Fixed tokenizer with synthesized visual tokens__init__.py: ComfyUI registrationtest_*.py: Diagnostic scripts for testing model behavior
Testing
# Test basic generation (no CFG)
python test_minimal.py
# Test with logits processor
python test_with_processor.py
# Check token ID mappings
python check_special_tokens.py
Contributing
This is an experimental project. Contributions welcome, especially:
- Fixes for token generation issues
- Alternative decoding strategies
- Better prompt engineering
- Documentation improvements
License
This integration code: MIT License
Emu3.5 models and original code: Apache 2.0 (BAAI)
Citation
If you use Emu3.5 in your research, please cite:
@article{emu3.5,
title={Emu3.5: Native Multimodal Models are World Learners},
author={Emu3.5 Team},
journal={arXiv preprint arXiv:2510.26583},
year={2024}
}
@article{emu3,
title={Emu3: Next-Token Prediction is All You Need},
author={Wang, Xinlong and others},
journal={arXiv preprint arXiv:2409.18869},
year={2024}
}
Links
- Emu3.5 Project: https://emu.world/
- Paper: https://arxiv.org/pdf/2510.26583
- Official Code: https://github.com/baaivision/Emu3.5
- Model Weights: https://huggingface.co/BAAI/Emu3.5-Image
- ComfyUI: https://github.com/comfyanonymous/ComfyUI
Acknowledgments
Special thanks to:
- BAAI Team for releasing Emu3.5 and maintaining open-source models
- ComfyUI Community for the excellent node framework
- Hugging Face for model hosting infrastructure
Disclaimer: This is an unofficial, experimental integration. For official Emu3.5 usage, refer to BAAI's repository.
🎯 Emu3.5 Feature To-Do List
HIGH PRIORITY - Officially Supported, High Value
1. Image Editing / Variation
| Attribute | Status | |-----------|--------| | Status | Parameter exists in Emu35Sampler but not implemented | | Official Support | ✅ Yes (in official repo) | | Value | ⭐⭐⭐⭐⭐ (Very useful for workflows) | | Complexity | 🔧🔧 Medium |
Description: Feed an input image to the model for editing/variation
Implementation:
- Encode input image with VQ-VAE
- Format as:
BOI -> resolution -> IMG -> visual tokens -> EOI - Prepend to prompt or use specific editing template
- Model generates modified/variation of input
Use Cases:
- "Make this image look like it's at sunset"
- "Add a hat to the person in this photo"
- Style transfer, variations, refinements
2. Chain-of-Thought Prompting
| Attribute | Status | |-----------|--------| | Status | CoT extraction implemented, but no way to REQUEST CoT | | Official Support | ✅ Yes (special tokens set) | | Value | ⭐⭐⭐⭐ (Research, debugging, better results) | | Complexity | 🔧 Easy |
Description: Ask model to show reasoning BEFORE generating image
Implementation:
- Add checkbox: "Enable Chain-of-Thought"
- Modify prompt to request reasoning: "Think step by step before generating the image. {prompt}"
- Or wrap prompt in CoT tokens
Use Cases:
- Understanding why model chose certain composition
- Better results through explicit reasoning
- Research and analysis
MEDIUM PRIORITY - Officially Supported, Moderate Value
3. Multiple Images in Single Generation
| Attribute | Status | |-----------|--------| | Status | Not implemented | | Official Support | ✅ Yes (multimodal_decode handles it) | | Value | ⭐⭐⭐⭐ (Creative workflows) | | Complexity | 🔧🔧🔧 Medium-High |
Description: Model generates multiple images in one pass
Implementation:
- Change RETURN_TYPES to support image arrays
- Parse all images from multimodal_decode
- Return list/batch of images
Challenges:
- ComfyUI node output typing (might need custom output type)
- UI complexity (how to display multiple images)
Use Cases:
- "Generate a before and after comparison"
- "Show 3 variations of this concept"
- Sequential storytelling
4. Streaming Generation
| Attribute | Status | |-----------|--------| | Status | Not implemented | | Official Support | ✅ Yes (streaming_generate function) | | Value | ⭐⭐⭐ (UX improvement) | | Complexity | 🔧🔧🔧🔧 High |
Description: Show progressive generation (text/image tokens as they're generated)
Implementation:
- Use official
streaming_generate()instead ofgenerate() - Implement callback/webhook to update UI
- Requires async/threading support
Challenges:
- ComfyUI architecture doesn't support streaming well
- Complex to implement properly
- May need custom UI components
Use Cases:
- Real-time feedback during long generations
- Early stopping if generation goes wrong
- Better UX for slow models
5. Video Generation (Emu3.5-Video)
| Attribute | Status | |-----------|--------| | Status | Not implemented at all | | Official Support | ✅ Yes (separate model: Emu3.5-Video) | | Value | ⭐⭐⭐⭐⭐ (Huge feature) | | Complexity | 🔧🔧🔧🔧🔧 Very High |
Description: Text-to-Video and Image-to-Video generation
Implementation:
- Create separate loader for Video model
- Implement temporal dimension handling
- Video tokenization/detokenization
- Frame generation and encoding
Requirements:
- Emu3.5-Video model (separate download)
- Video codec support
- Significantly more VRAM
Use Cases:
- Text-to-video: "A cat walking across a room"
- Image-to-video: Animate a still image
- Video editing/style transfer
LOW PRIORITY - Quality of Life Improvements
6. Batch Generation
| Attribute | Status | |-----------|--------| | Status | Not implemented | | Official Support | ⚠️ Partial (model supports batch, but examples don't use it) | | Value | ⭐⭐⭐ (Efficiency) | | Complexity | 🔧🔧 Medium |
Description: Generate multiple images in one pass (different prompts/seeds)
Implementation:
- Add "batch_size" parameter
- Process multiple prompts in parallel
- Return batched tensor
Use Cases:
- Generate 4 variations with different seeds
- Compare different prompts side-by-side
7. Advanced Sampling Parameters
| Attribute | Status | |-----------|--------| | Status | Using defaults | | Official Support | ✅ Yes (many parameters available) | | Value | ⭐⭐ (Power users) | | Complexity | 🔧 Easy |
Description: Expose more generation parameters to users
Current: image_top_k, temperature are hardcoded
Could Add:
text_top_k,text_top_p,text_temperatureimage_top_p,image_temperatureuse_differential_samplingtogglenum_beamsfor beam search
Use Cases:
- Fine-tuning generation quality
- Experimentation
- Reproducibility research
8. Negative Prompting (Better Implementation)
| Attribute | Status | |-----------|--------| | Status | Basic implementation in Emu35Sampler | | Official Support | ⚠️ Indirect (CFG supports it) | | Value | ⭐⭐⭐ (Quality control) | | Complexity | 🔧 Easy |
Description: Better negative prompt handling
Current Issue: Negative prompt uses same template as positive
Could Improve:
- Test if different negative templates work better
- Add "negative weight" parameter
- Research optimal negative prompting for Emu3.5
Use Cases:
- "No hands, no text, no watermarks"
- Fine control over what NOT to generate
9. LoRA / Fine-tuning Support
| Attribute | Status | |-----------|--------| | Status | Not implemented | | Official Support | ❓ Unknown (likely possible, not documented) | | Value | ⭐⭐⭐⭐⭐ (Customization) | | Complexity | 🔧🔧🔧🔧 High |
Description: Load and apply LoRA weights for custom styles
Implementation:
- LoRA weight loading
- Merging with base model
- Multi-LoRA support
Use Cases:
- Custom art styles
- Specific character/object training
- Domain adaptation
10. Model/Tokenizer Caching
| Attribute | Status | |-----------|--------| | Status | Loads from scratch each time | | Official Support | N/A (implementation detail) | | Value | ⭐⭐⭐ (Performance) | | Complexity | 🔧 Easy |
Description: Cache loaded models between runs
Implementation:
- Global cache dict
- Check cache before loading
- LRU eviction for memory management
Use Cases:
- Faster workflow iterations
- Multiple workflows using same model
11. Image Resolution Validation
| Attribute | Status | |-----------|--------| | Status | No validation | | Official Support | N/A (safety feature) | | Value | ⭐⭐ (UX) | | Complexity | 🔧 Easy |
Description: Warn/prevent invalid resolutions
Implementation:
- Check width/height are multiples of 16
- Warn if too large (VRAM limits)
- Auto-adjust to nearest valid size
Use Cases:
- Prevent user errors
- Better error messages
12. Prompt Templates Library
| Attribute | Status | |-----------|--------| | Status | Templates are hardcoded | | Official Support | N/A (UX feature) | | Value | ⭐⭐ (Convenience) | | Complexity | 🔧 Easy |
Description: Preset templates for different use cases
Templates:
- Photography styles
- Art styles
- Specific domains (architecture, product, portrait)
Use Cases:
- Quick starts for new users
- Consistent results
EXPERIMENTAL - Not Officially Documented
13. Inpainting/Outpainting
| Attribute | Status | |-----------|--------| | Status | Not implemented | | Official Support | ❓ Unknown | | Value | ⭐⭐⭐⭐⭐ (Very powerful) | | Complexity | 🔧🔧🔧🔧🔧 Very High |
Description: Edit specific regions or extend images
Would Require: Research into whether Emu3.5 supports this
14. ControlNet-style Conditioning
| Attribute | Status | |-----------|--------| | Status | Not implemented | | Official Support | ❓ Unknown | | Value | ⭐⭐⭐⭐⭐ (Precise control) | | Complexity | 🔧🔧🔧🔧🔧 Very High |
Description: Condition generation on edge maps, depth, etc.
Would Require: Research and possibly model fine-tuning
📋 Recommended Implementation Order
Phase 1: Quick Wins (1-2 weeks)
- [ ] Chain-of-Thought prompting (add checkbox)
- [ ] Advanced sampling parameters (expose existing params)
- [ ] Model caching (performance boost)
Phase 2: High Value (2-4 weeks)
- [ ] Image editing/variation (use input_image parameter)
- [ ] Negative prompt improvements
- [ ] Batch generation
Phase 3: Major Features (1-2 months)
- [ ] Multiple images per generation
- [ ] Video generation (Emu3.5-Video model)
Phase 4: Advanced (Research required)
- [ ] Streaming generation
- [ ] LoRA support
- [ ] Inpainting/outpainting (if possible)
🎯 Top 3 Recommendations
- Image Editing - High value, moderate effort, officially supported
- CoT Prompting - Easy win, improves quality, unique feature
- Video Generation - Huge feature, but requires separate model