ComfyUI Extension: Maya1 TTS
ComfyUI node for Maya1 TTS - Expressive voice generation with 20+ emotions, voice design, and SNAC neural codec
Custom Nodes (0)
README
ComfyUI-Maya1_TTS
Expressive Voice Generation with Emotions for ComfyUI
A ComfyUI node pack for Maya1, a 3B-parameter speech model built for expressive voice generation with rich human emotion and precise voice design.
โจ Features
- ๐ญ Voice Design through natural language descriptions
- ๐ 20+ Emotions: laugh, cry, whisper, angry, sigh, gasp, scream, and more
- โก Real-time Generation with SNAC neural codec (24kHz audio)
- ๐ง Multiple Attention Mechanisms: SDPA, Flash Attention 2, Sage Attention
- ๐พ Quantization Support: 4-bit and 8-bit for memory-constrained GPUs
- ๐ Native ComfyUI Cancel: Stop generation anytime
- ๐ Progress Tracking: Real-time token generation speed (it/s)
- ๐ Model Caching: Fast subsequent generations
- ๐ฏ Smart VRAM Management: Auto-clears on dtype changes
๐ฆ Installation
<details> <summary><b>Quick Install (Click to expand)</b></summary>1. Clone the Repository
cd ComfyUI/custom_nodes/
git clone https://github.com/Saganaki22/ComfyUI-Maya1_TTS.git
cd ComfyUI-Maya1_TTS
2. Install Dependencies
Core dependencies (required):
pip install torch>=2.0.0 transformers>=4.50.0 numpy>=1.21.0 snac>=1.0.0
Or install from requirements.txt:
pip install -r requirements.txt
</details>
<details>
<summary><b>Optional: Enhanced Performance (Click to expand)</b></summary>
Quantization (Memory Savings)
For 4-bit/8-bit quantization support:
pip install bitsandbytes>=0.41.0
Memory savings:
- 4-bit: ~6GB โ ~3GB VRAM (slight quality loss)
- 8-bit: ~6GB โ ~4GB VRAM (minimal quality loss)
Accelerated Attention
Flash Attention 2 (fastest, CUDA only):
pip install flash-attn>=2.0.0
Sage Attention (memory efficient):
pip install sageattention>=1.0.0
Install All Optional Dependencies
pip install bitsandbytes flash-attn sageattention
</details>
<details>
<summary><b>Download Maya1 Model (Click to expand)</b></summary>
Model Location
Models go in: ComfyUI/models/maya1-TTS/
Expected Folder Structure
After downloading, your model folder should look like this:
ComfyUI/
โโโ models/
โโโ maya1-TTS/
โโโ maya1/ # Model name (can be anything)
โโโ chat_template.jinja # Chat template
โโโ config.json # Model configuration
โโโ generation_config.json # Generation settings
โโโ model-00001-of-00002.safetensors # Model weights (shard 1)
โโโ model-00002-of-00002.safetensors # Model weights (shard 2)
โโโ model.safetensors.index.json # Weight index
โโโ special_tokens_map.json # Special tokens
โโโ tokenizer/ # Tokenizer subfolder
โโโ chat_template.jinja # Chat template (duplicate)
โโโ special_tokens_map.json # Special tokens (duplicate)
โโโ tokenizer.json # Tokenizer vocabulary (22.9 MB)
โโโ tokenizer_config.json # Tokenizer config
Critical files required:
config.json- Model architecture configurationgeneration_config.json- Default generation parametersmodel-00001-of-00002.safetensors&model-00002-of-00002.safetensors- Model weights (2 shards)model.safetensors.index.json- Weight index mappingchat_template.jinja&special_tokens_map.json- In root foldertokenizer/folder with all 4 tokenizer files
Note: You can have multiple models by creating separate folders like maya1, maya1-finetuned, etc.
Option 1: Hugging Face CLI (Recommended)
# Install HF CLI
pip install huggingface-hub
# Create directory
cd ComfyUI
mkdir -p models/maya1-TTS
# Download model
hf download maya-research/maya1 --local-dir models/maya1-TTS/maya1
Option 2: Python Script
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="maya-research/maya1",
local_dir="ComfyUI/models/maya1-TTS/maya1",
local_dir_use_symlinks=False
)
Option 3: Manual Download
- Go to Maya1 on HuggingFace
- Download all files to
ComfyUI/models/maya1-TTS/maya1/
Restart ComfyUI to load the new nodes. The node will appear under:
Add Node โ audio โ Maya1 TTS (AIO)
</details>๐ฎ Usage
Node: Maya1 TTS (AIO)
All-in-one node for loading models and generating speech.
<details> <summary><b>Model Settings</b></summary>model_name (dropdown)
- Select from models in
ComfyUI/models/maya1-TTS/ - Model auto-discovered on startup
dtype (dropdown)
4bit: NF4 quantization (~6GB VRAM, requires bitsandbytes, SLOWER)8bit: INT8 quantization (~7GB VRAM, requires bitsandbytes, SLOWER)float16: 16-bit half precision (~8-9GB VRAM, FAST, good quality)bfloat16: 16-bit brain float (~8-9GB VRAM, FAST, recommended)float32: 32-bit full precision (~16GB VRAM, highest quality, slower)
โ ๏ธ IMPORTANT: Quantization (4-bit/8-bit) is SLOWER than float16/bfloat16!
- Only use quantization if you have limited VRAM (<10GB)
- If you have 10GB+ VRAM, use float16 or bfloat16 for best speed
attention_mechanism (dropdown)
sdpa: PyTorch SDPA (default, fastest for single TTS)flash_attention_2: Flash Attention 2 (batch inference)sage_attention: Sage Attention (memory efficient)
device (dropdown)
cuda: Use GPU (recommended)cpu: Use CPU (slower)
voice_description (multiline text)
Describe the voice using natural language:
Realistic male voice in the 30s with American accent. Normal pitch, warm timbre, conversational pacing.
Voice Components:
- Age:
in their 20s,30s,40s,50s - Gender:
Male voice,Female voice - Accent:
American,British,Australian,Indian,Middle Eastern - Pitch:
high pitch,normal pitch,low pitch - Timbre:
warm,gravelly,smooth,raspy - Pacing:
fast pacing,conversational,slow pacing - Tone:
happy,angry,curious,energetic,calm
text (multiline text)
Text to synthesize with optional emotion tags:
Hello! This is Maya1 <laugh> the best open source voice AI!
</details>
<details>
<summary><b>Generation Settings</b></summary>
keep_model_in_vram (boolean)
True: Keep model loaded for faster repeated generationsFalse: Clear VRAM after generation (saves memory)- Auto-clears when dtype changes
temperature (0.1-2.0, default: 0.4)
- Lower = more consistent
- Higher = more varied/creative
top_p (0.1-1.0, default: 0.9)
- Nucleus sampling parameter
- 0.9 recommended for natural speech
max_tokens (100-8000, default: 2000)
- Maximum audio tokens to generate
- Higher = longer audio
repetition_penalty (1.0-2.0, default: 1.1)
- Reduces repetitive speech
- 1.1 is good default
seed (integer, default: 0)
- Use same seed for reproducible results
- Use ComfyUI's control_after_generate for random/increment
audio (ComfyUI AUDIO type)
- 24kHz mono audio
- Compatible with all ComfyUI audio nodes
- Connect to PreviewAudio, SaveAudio, etc.
๐ญ Emotion Tags
Add emotions anywhere in your text using <tag> syntax:
Hello! This is amazing <laugh> I can't believe it!
After all we went through <cry> I can't believe he was the traitor.
Wow! <gasp> This place looks incredible!
<details>
<summary><b>All 17 Available Emotions (Click to expand)</b></summary>
Laughter & Joy:
<laugh>- Normal laugh<laugh_harder>- Intense laughing<giggle>- Light giggling<chuckle>- Soft chuckle
Sadness & Sighs:
<cry>- Crying<sigh>- Sighing
Surprise & Breath:
<gasp>- Surprised gasp<breathing>- Heavy breathing
Intensity & Emotion:
<whisper>- Whispering<angry>- Angry tone<scream>- Screaming
Natural Sounds:
<snort>- Snorting<yawn>- Yawning<cough>- Coughing<sneeze>- Sneezing<humming>- Humming<throat_clearing>- Clearing throat
๐ก Tip: Hover over the node title to see all emotion tags and usage examples!
๐ฌ Example Character Speeches
<details> <summary><b>Generative AI & ComfyUI Examples (Click to expand)</b></summary>Example 1: Excited AI Researcher
Voice Description:
Female voice in her 30s with American accent. High pitch, energetic tone at high intensity, fast pacing.
Text:
Oh my god! <laugh> Have you seen the new Stable Diffusion model in ComfyUI? The quality is absolutely incredible! <gasp> I just generated a photorealistic portrait in like 20 seconds. This is game-changing for our workflow!
Example 2: Skeptical Developer
Voice Description:
Male voice in his 40s with British accent. Low pitch, calm tone, conversational pacing.
Text:
I've been testing this new node pack in ComfyUI <sigh> and honestly, I'm impressed. At first I was skeptical about the whole generative AI hype, but <gasp> the control you get with custom nodes is remarkable. This changes everything.
Example 3: Enthusiastic Tutorial Creator
Voice Description:
Female voice in her 20s with Australian accent. Normal pitch, warm timbre, energetic tone at medium intensity.
Text:
Hey everyone! <laugh> Welcome back to my ComfyUI tutorial series! Today we're diving into the most powerful image generation workflow I've ever seen. <gasp> You're not gonna believe how easy this is! Let's get started!
Example 4: Frustrated Beginner
Voice Description:
Male voice in his 30s with American accent. Normal pitch, stressed tone at medium intensity, fast pacing.
Text:
Why won't this workflow run? <angry> I've connected all the nodes exactly like the tutorial showed! <sigh> Wait... Oh no. <laugh> I forgot to load the checkpoint model. Classic beginner mistake! Okay, let's try this again.
Example 5: Amazed AI Artist
Voice Description:
Female voice in her 40s with Indian accent. Normal pitch, curious tone, slow pacing, dramatic delivery.
Text:
When I first discovered ComfyUI <whisper> I thought it was just another image generator. But then <gasp> I realized you can chain workflows together, use custom models, and <laugh> even generate animations! This is the future of digital art!
Example 6: Confident AI Entrepreneur
Voice Description:
Male voice in his 50s with Middle Eastern accent. Low pitch, gravelly timbre, slow pacing, confident tone at high intensity.
Text:
The generative AI revolution is here. <dramatic pause> ComfyUI gives us the tools to build production-ready workflows. <chuckle> While others are still playing with web UIs, we're automating entire creative pipelines. This is how you stay ahead of the curve.
</details>
โ๏ธ Advanced Configuration
<details> <summary><b>Attention Mechanisms Comparison</b></summary>| Mechanism | Speed | Memory | Best For | Requirements | |-----------|-------|--------|----------|--------------| | SDPA | โกโกโก | Good | Single TTS generation | PyTorch โฅ2.0 | | Flash Attention 2 | โกโก | Good | Batch processing | flash-attn, CUDA | | Sage Attention | โกโก | Excellent | Long sequences | sageattention |
Why is SDPA fastest for TTS?
- Optimized for single-sequence autoregressive generation
- Lower kernel launch overhead (~20ฮผs vs 50-60ฮผs)
- Flash/Sage Attention shine with batch size โฅ8
Recommendation: Use SDPA (default) for single audio generation.
</details> <details> <summary><b>Quantization Details</b></summary>โ ๏ธ CRITICAL: Quantization is SLOWER than fp16/bf16!
Memory Usage (Maya1 3B Model)
| Dtype | VRAM Usage | Speed | Quality | |-------|------------|-------|---------| | 4-bit NF4 | ~6GB | Slow โก | Good (slight loss) | | 8-bit INT8 | ~7GB | Slow โก | Excellent (minimal loss) | | float16 | ~8-9GB | Fast โกโกโก | Excellent | | bfloat16 | ~8-9GB | Fast โกโกโก | Excellent | | float32 | ~16GB | Medium โกโก | Perfect |
4-bit NF4 Quantization
Features:
- Uses NormalFloat4 (NF4) for best 4-bit quality
- Double quantization (nested) for better accuracy
- Memory savings: ~6GB (vs ~8-9GB for fp16)
When to use:
- You have limited VRAM (8GB or less GPU)
- Speed is not critical (inference is slower due to dequantization)
- Need to fit model in smaller VRAM
When NOT to use:
- You have 10GB+ VRAM โ Use float16/bfloat16 instead for better speed!
8-bit INT8 Quantization
Features:
- Standard 8-bit integer quantization
- Memory savings: ~7GB (vs ~8-9GB for fp16)
- Minimal quality impact
When to use:
- You have moderate VRAM constraints (8-10GB GPU)
- Want good quality with some memory savings
- Speed is not critical
When NOT to use:
- You have 10GB+ VRAM โ Use float16/bfloat16 instead for better speed!
Why is Quantization Slower?
Quantized models require dequantization on every forward pass:
- Model weights stored in 4-bit/8-bit
- Weights dequantized to fp16 for computation
- Computation happens in fp16
- Extra overhead = slower inference
Recommendation: Only use quantization if you truly need the memory savings!
Automatic Dtype Switching
The node automatically clears VRAM when you switch dtypes:
๐ Dtype changed from bfloat16 to 4bit
Clearing cache to reload model...
This prevents dtype mismatch errors and ensures correct quantization.
</details> <details> <summary><b>Console Progress Output</b></summary>Real-time generation statistics in the console:
๐ฒ Seed: 1337
๐ต Generating speech (max 2000 tokens)...
Tokens: 500/2000 | Speed: 12.45 it/s | Elapsed: 40.2s
โ
Generated 1500 tokens in 120.34s (12.47 it/s)
it/s = iterations per second (tokens/second)
</details>๐ Troubleshooting
<details> <summary><b>Model Not Found</b></summary>Error: No valid Maya1 models found
Solutions:
- Check model location:
ComfyUI/models/maya1-TTS/ - Download model (see Installation section)
- Restart ComfyUI
- Check console for model discovery messages
Error: CUDA out of memory
Memory requirements:
- 4-bit: ~6GB VRAM (slower)
- 8-bit: ~7GB VRAM (slower)
- float16/bfloat16: ~8-9GB VRAM (fast, recommended)
- float32: ~16GB VRAM
Solutions (try in order):
- Use 4-bit dtype if you have โค8GB VRAM (~6GB usage)
- Use 8-bit dtype if you have ~8-10GB VRAM (~7GB usage)
- Use float16 if you have 10GB+ VRAM (faster than quantization!)
- Enable
keep_model_in_vram=Falseto free VRAM after generation - Reduce
max_tokensto 1000-1500 - Close other VRAM-heavy applications
- Use CPU (much slower but works)
Note: If you have 10GB+ VRAM, use float16/bfloat16 for best speed!
</details> <details> <parameter name="summary"><b>Quantization Errors</b></summary>Error: bitsandbytes not found
Solution:
pip install bitsandbytes>=0.41.0
Error: Quantization requires CUDA
Solution:
- 4-bit/8-bit only work on CUDA
- Switch to
float16/bfloat16for CPU
Error: No SNAC audio tokens generated!
Solutions:
- Increase
max_tokensto 2000-4000 - Adjust
temperatureto 0.3-0.5 - Simplify voice description
- Check text isn't too long
- Try different seed value
Error: flash-attn won't install
Solution:
- Flash Attention requires CUDA and specific setup
- Just use SDPA instead (works great, actually faster for TTS!)
- SDPA is the recommended default
Issue: Can't see the "?" or "i" icon, only hover tooltip
Answer: This is normal and working correctly!
- ComfyUI's
DESCRIPTIONcreates a hover tooltip - Some ComfyUI versions show no visible icon
- Just hover over the node title area to see help
- Contains all emotion tags and usage examples
๐ Performance Tips
- Use float16/bfloat16 if you have 10GB+ VRAM (fastest!)
- Use quantization (4-bit/8-bit) ONLY if limited VRAM (<10GB) - slower but fits in memory
- Keep SDPA as attention mechanism (fastest for single TTS)
- Enable model caching (
keep_model_in_vram=True) for multiple generations - Optimize max_tokens: Start with 1500-2000
- Batch similar requests with same voice description for efficiency
โ ๏ธ Speed ranking: float16/bfloat16 (fastest) > float32 > 8-bit > 4-bit (slowest)
๐๏ธ Technical Details
<details> <summary><b>Architecture</b></summary>- Model: 3B-parameter Llama-based transformer
- Audio Codec: SNAC (Speech Neural Audio Codec)
- Sample Rate: 24kHz mono
- Frame Structure: 7 tokens per frame (3 hierarchical levels)
- Token Ranges:
- SNAC tokens: 128266-156937
- Text EOS: 128009
- SNAC EOS: 128258
- Compression: ~0.98 kbps streaming
ComfyUI-Maya1_TTS/
โโโ __init__.py # Node registration
โโโ nodes/
โ โโโ __init__.py
โ โโโ maya1_tts_combined.py # AIO node
โโโ core/
โ โโโ model_wrapper.py # Model loading & quantization
โ โโโ snac_decoder.py # SNAC audio decoding
โ โโโ utils.py # Utilities & cancel support
โโโ resources/
โ โโโ emotions.txt # 17 emotion tags
โ โโโ prompt_examples.txt # Voice description examples
โโโ pyproject.toml # Package metadata
โโโ requirements.txt # Dependencies
โโโ README.md # This file
</details>
<details>
<summary><b>ComfyUI Integration</b></summary>
- Cancel Support: Native
execution.interruption_requested() - Progress Bars:
comfy.utils.ProgressBar - Audio Format: ComfyUI AUDIO type (24kHz mono)
- Model Caching: Automatic with dtype change detection
- VRAM Management: Manual control via toggle
https://github.com/user-attachments/assets/7a5b0f96-8d59-4e32-870b-03017ecc111f
๐ Credits
- Maya1 Model: Maya Research
- HuggingFace: maya-research/maya1
- SNAC Codec: hubertsiuzdak/snac
- ComfyUI: comfyanonymous/ComfyUI
๐ License
Apache 2.0 - See LICENSE
Maya1 model is also licensed under Apache 2.0 by Maya Research.
๐ Links
- Issues: GitHub Issues
- Maya Research: Website | Twitter
- Model Page: HuggingFace
๐ Citation
If you use Maya1 in your research, please cite:
@misc{maya1voice2025,
title={Maya1: Open Source Voice AI with Emotional Intelligence},
author={Maya Research},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/maya-research/maya1}},
}
Bringing expressive voice AI to everyone through open source.