ComfyUI Extension: ComfyUI-ThinkSound_Wrapper
A ComfyUI wrapper implementation of ThinkSound - an advanced AI model for generating high-quality audio from text descriptions and video content using Chain-of-Thought (CoT) reasoning.
Custom Nodes (0)
README
ComfyUI-ThinkSound_Wrapper
A ComfyUI wrapper implementation of ThinkSound - an advanced AI model for generating high-quality audio from text descriptions and video content using Chain-of-Thought (CoT) reasoning.
<img width="2989" height="1324" alt="image" src="https://github.com/user-attachments/assets/aae9c5a4-6113-4d20-809f-307aa2202086" />https://github.com/user-attachments/assets/b3f090a7-fe58-4bb0-8e21-cb19377aa9cf
14.02.25 - Add the ability to use the big model: thinksound.ckpt
- you can download it from here: https://huggingface.co/FunAudioLLM/ThinkSound/resolve/main/thinksound.ckpt?download=true
🎵 Features
- Text-to-Audio Generation: Create audio from detailed text descriptions
- Video-to-Audio Generation: Generate synchronized audio that matches video content
- Chain-of-Thought Reasoning: Use detailed CoT prompts for precise audio control
- Multimodal Understanding: Combines visual and textual information for better results
- ComfyUI Integration: Easy-to-use nodes that integrate seamlessly with ComfyUI workflows
🎬 What Makes ThinkSound Special
ThinkSound uses multimodal AI to understand both text and video:
- MetaCLIP for visual scene understanding
- Synchformer for temporal motion analysis
- T5 for detailed language understanding
- Advanced diffusion model for high-quality audio synthesis
📋 Requirements
System Requirements
- NVIDIA GPU with at least 12GB VRAM (24GB+ recommended)
- Python 3.8+
- ComfyUI installed and working
- Windows/Linux (tested on Windows)
Dependencies
The following Python packages will be installed automatically:
torch>=2.0.1
torchaudio>=2.0.2
torchvision>=0.15.0
transformers>=4.20.0
accelerate>=0.20.0
alias-free-torch==0.0.6
descript-audio-codec==1.0.0
vector-quantize-pytorch==1.9.14
einops==0.7.0
open-clip-torch>=2.20.0
huggingface_hub
safetensors
sentencepiece>=0.1.99
🚀 Installation
Step 1: Install ComfyUI Custom Node
-
Navigate to your ComfyUI custom nodes folder:
cd ComfyUI/custom_nodes/
-
Clone this repository:
git clone https://github.com/ShmuelRonen/ComfyUI-ThinkSound_Wrapper.git cd ComfyUI-ThinkSound_Wrapper
-
Your folder structure should look like:
ComfyUI-ThinkSound_Wrapper/ ├── __init__.py ├── nodes.py ├── requirements.txt ├── thinksound/ │ ├── data/ │ ├── models/ │ ├── inference/ │ └── ... └── README.md
Step 3: Install Dependencies
Option A: Install all dependencies (recommended)
pip install -r requirements.txt
Option B: Install minimal dependencies
pip install torch torchaudio torchvision transformers accelerate
pip install alias-free-torch==0.0.6 descript-audio-codec==1.0.0 vector-quantize-pytorch==1.9.14
pip install einops open-clip-torch huggingface_hub safetensors sentencepiece
Step 4: Download Models
-
Download the models pack from Google Drive:
-
Extract the downloaded file and place models in:
ComfyUI/models/thinksound/ ├── thinksound_light.ckpt ├── vae.ckpt ├── synchformer_state_dict.pth └── (other model files)
-
Create the thinksound models folder if it doesn't exist:
mkdir -p ComfyUI/models/thinksound
Step 5: Restart ComfyUI
- Restart ComfyUI completely
- Check the console for successful loading messages:
🎉 ThinkSound modules imported successfully! ✅ SUCCESS: Found FeaturesUtils in thinksound.data.v2a_utils.feature_utils_224
🎛️ Usage
Available Nodes
After installation, you'll find these nodes in ComfyUI:
-
ThinkSound Model Loader
- Loads the main ThinkSound diffusion model
- Input:
thinksound_model
(select your .ckpt file) - Output:
thinksound_model
-
ThinkSound Feature Utils Loader
- Loads VAE and Synchformer models
- Inputs:
vae_model
,synchformer_model
- Output:
feature_utils
-
ThinkSound Sampler
- Generates audio from text and/or video
- Main generation node
Basic Workflow
ThinkSound Model Loader ──┐
├── ThinkSound Sampler ── Audio Output
ThinkSound Feature Utils ─┘
Loader
Sampler Node Parameters
- Duration: Audio length in seconds (1.0 - 30.0)
- Steps: Denoising steps (30 recommended)
- CFG Scale: Guidance strength (5.0 recommended)
- Seed: Random seed for reproducibility
- Caption: Short audio description
- CoT Description: Detailed Chain-of-Thought prompt
- Video: Optional video input for video-to-audio generation
🎵 Examples
Text-to-Audio Examples
Example 1: Simple Audio
Caption: "Dog barking"
CoT Description: "Generate the sound of a medium-sized dog barking outdoors. The barking should be natural and energetic, with slight echo to suggest an open space. Include 3-4 distinct barks with realistic timing between them."
Example 2: Complex Scene
Caption: "Ocean waves at beach"
CoT Description: "Create gentle ocean waves lapping against the shore. Add subtle sounds of water receding over sand and pebbles. Include distant seagull calls and a light ocean breeze for natural ambiance."
Example 3: Musical Content
Caption: "Jazz piano"
CoT Description: "Generate a smooth jazz piano melody in a minor key. Include syncopated rhythms, bluesy chord progressions, and subtle improvisation. The tempo should be moderate and relaxing, perfect for a late-night cafe atmosphere."
Video-to-Audio Generation
- Load a video using ComfyUI's video loader nodes
- Connect the video to the ThinkSound Sampler's video input
- Add descriptive text to guide the audio generation
- Generate audio that syncs with the video content
⚠️ Important Notes
Model Precision
- ThinkSound requires fp32 precision for stable operation
- The nodes automatically use fp32 (no precision selection needed)
- Do not force fp16 as it may cause tensor dimension errors
Memory Requirements
- 8GB VRAM minimum for basic operation
- 12GB+ VRAM recommended for longer audio generation
- Enable "force_offload" to save VRAM (enabled by default)
Video Input Format
- Supported: MP4, AVI, MOV (any format ComfyUI can load)
- Recommended: 8-30 seconds duration
- Processing: Automatically handled by the node
🐛 Troubleshooting
Common Issues
Issue: "ThinkSound source code not installed"
Solution: Ensure you've downloaded the ThinkSound repository to the 'thinksound' folder
Issue: "ImportError: No module named 'alias_free_torch'"
Solution: Install missing dependencies:
pip install alias-free-torch==0.0.6 descript-audio-codec==1.0.0 vector-quantize-pytorch==1.9.14
Issue: "Input type (float) and bias type (struct c10::Half) should be the same"
Solution: This is resolved automatically with fp32 precision. Restart ComfyUI if you see this error.
Issue: "Tensors must have same number of dimensions"
Solution: Update to the latest version of the nodes. This was fixed in recent updates.
Issue: Models not loading
Solution:
1. Check that models are in ComfyUI/models/thinksound/
2. Verify model file names match the dropdown options
3. Check ComfyUI console for specific error messages
Performance Tips
- Start with shorter durations (8-10 seconds) for testing
- Use lower step counts (12-16) for faster generation during testing
- Enable force_offload to manage VRAM usage
- Close other GPU-intensive applications while generating
📊 Expected Performance
Generation Times (approximate)
- 8 seconds audio: 30-60 seconds on RTX 3080
- 15 seconds audio: 60-120 seconds on RTX 3080
- Video analysis: Additional 10-20 seconds
Quality Settings
- Steps 12-16: Fast, good quality
- Steps 24: Recommended balance
- Steps 32+: High quality, slower
🔄 Updates
To update the project:
- Pull latest changes:
git pull origin main
- Update ThinkSound source:
cd thinksound && git pull
- Restart ComfyUI
📄 License
This project is a wrapper implementation based on ThinkSound by FunAudioLLM. Please refer to the original ThinkSound repository for licensing information.
🤝 Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
📞 Support
If you encounter issues:
- Check the troubleshooting section above
- Review ComfyUI console output for error messages
- Open an issue on GitHub with detailed error information
🎉 Acknowledgments
- ThinkSound Team for the original model and research
- ComfyUI Community for the excellent framework
- Contributors who helped test and improve this wrapper implementation
Enjoy creating amazing audio with ThinkSound! 🎵✨