ComfyUI Extension: ComfyUI-AudioX
A powerful audio generation extension for ComfyUI that integrates AudioX models for high-quality audio synthesis from text and video inputs.
Custom Nodes (0)
README
ComfyUI-AudioX
A powerful audio generation extension for ComfyUI that integrates AudioX models a finetuned version of stable audio tools for high-quality audio synthesis from text and video inputs. Currently only working on min. 16gb Vram - tested on a single 4090.
šµ Features
- Text to Audio: Generate high-quality audio from text descriptions with enhanced conditioning
- Text to Music: Create musical compositions with style, tempo, and mood controls
- Video to Audio: Extract and generate audio from video content with advanced conditioning
- Enhanced Conditioning: Separate CFG scales, conditioning weights, negative prompting, and prompt enhancement
- Professional Audio Processing: Volume control with LUFS normalization, limiting, and precise gain staging
- Video Processing: Mute videos and combine with generated audio
š Installation
1. System Dependencies (Required)
Install these system dependencies first:
Windows:
# Install ffmpeg (required for video processing)
# Download from: https://ffmpeg.org/download.html
# Or use chocolatey: choco install ffmpeg
# Install Microsoft Visual C++ Build Tools (if not already installed)
# Download from: https://visualstudio.microsoft.com/visual-cpp-build-tools/
Linux/Ubuntu:
sudo apt update
sudo apt install ffmpeg libsndfile1-dev build-essential
macOS:
brew install ffmpeg libsndfile
2. Clone Repository and Install Python Dependencies
cd ComfyUI/custom_nodes
git clone https://github.com/lum3on/ComfyUI-StableAudioX.git
cd ComfyUI-StableAudioX
# Install Python dependencies
pip install -r requirements.txt
# Optional: Run dependency checker to verify installation
python install_dependencies.py
Model Setup ā AudioX
- Model File: Download from Hugging Face - model.ckpt
- Config File: Download from Hugging Face - config.json
- Place both files in:
ComfyUI/models/diffusion_models/
rename model.ckpt file to AudioX.ckpt
Alternative Download via Hugging Face CLI
# Install huggingface-hub if not already installed
pip install huggingface-hub
# Download AudioX model files
huggingface-cli download HKUSTAudio/AudioX model.ckpt --local-dir ComfyUI/models/diffusion_models/
huggingface-cli download HKUSTAudio/AudioX config.json --local-dir ComfyUI/models/diffusion_models/
Model Directory Structure:
ComfyUI/models/diffusion_models/
āāā model.safetensors # AudioX model
āāā model_config.json # Model configuration file
System Requirements
- VRAM: 6GB+ recommended for optimal performance
- RAM: 16GB+ recommended
- Storage: ~5GB for model files
- GPU: CUDA-compatible GPU recommended (CPU supported but slower)
š Available Nodes
Core Generation Nodes
- AudioX Model Loader: Load AudioX models with device configuration and auto-detect config files
- AudioX Text to Audio: Basic text-to-audio generation with automatic prompt enhancement
- AudioX Text to Music: Basic text-to-music generation with automatic prompt enhancement
- AudioX Video to Audio: Basic video-to-audio generation with automatic prompt enhancement
- AudioX Video to Music: Generate musical soundtracks for videos
Enhanced Generation Nodes ā
- AudioX Enhanced Text to Audio: Advanced text-to-audio with negative prompting, templates, style modifiers, and conditioning modes
- AudioX Enhanced Text to Music: Advanced music generation with style, tempo, mood controls, and musical enhancement
- AudioX Enhanced Video to Audio: Advanced video-to-audio with separate CFG scales, conditioning weights, and enhanced prompting
Processing & Utility Nodes
- AudioX Audio Processor: Process and enhance audio
- AudioX Volume Control: Basic volume control with precise dB control and configurable step size
- AudioX Advanced Volume Control: Professional volume control with LUFS normalization, soft limiting, and fade controls
- AudioX Video Muter: Remove audio from video files
- AudioX Video Audio Combiner: Combine video with generated audio
- AudioX Multi-Modal Generation: Advanced multi-modal audio generation
- AudioX Prompt Helper: Utility for creating better audio prompts with templates
šÆ Quick Start
Basic Text to Audio
- Add AudioX Model Loader node and select your model from
diffusion_models/
- Add AudioX Text to Audio node
- Connect model output to audio generation node
- Enter your text prompt (automatic enhancement applied)
- Execute workflow
Enhanced Text to Audio with Advanced Controls ā
- Add AudioX Model Loader node
- Add AudioX Enhanced Text to Audio node
- Configure advanced options:
- Negative Prompt: Specify what to avoid (e.g., "muffled, distorted")
- Prompt Template: Choose from predefined templates (action, nature, music, etc.)
- Style Modifier: cinematic, realistic, ambient, dramatic, peaceful, energetic
- Conditioning Mode: standard, enhanced, super_enhanced, multi_aspect
- Adaptive CFG: Automatically adjusts CFG based on prompt specificity
- Execute for enhanced audio generation
Enhanced Video to Audio with Separate Controls ā
- Add AudioX Model Loader node
- Add AudioX Enhanced Video to Audio node
- Configure separate conditioning:
- Text CFG Scale: Control text conditioning strength (0.1-20.0)
- Video CFG Scale: Control video conditioning strength (0.1-20.0)
- Text Weight: Influence of text conditioning (0.0-2.0)
- Video Weight: Influence of video conditioning (0.0-2.0)
- Negative Prompt: Avoid unwanted audio characteristics
- Fine-tune balance between text prompts and video content
Professional Audio Workflow with Volume Control
- Generate audio using any AudioX generation node
- Add AudioX Advanced Volume Control for professional features:
- LUFS Normalization: Auto-normalize to broadcast standards (-23 LUFS)
- Soft Limiting: Prevent clipping with configurable threshold
- Fade In/Out: Add smooth fades to audio
- Precise Step Control: Ultra-fine volume adjustments (0.001 dB steps)
- Enable
auto_normalize_lufs
for automatic loudness normalization - Set
limiter_threshold_db
to prevent clipping (default: -1.0 dB) - Add fade_in_ms/fade_out_ms for smooth transitions
Enhanced Music Generation ā
- Add AudioX Enhanced Text to Music node
- Configure musical attributes:
- Music Style: classical, jazz, electronic, ambient, rock, folk, cinematic
- Tempo: slow, moderate, fast, very_fast
- Mood: happy, sad, peaceful, energetic, mysterious, dramatic
- Negative Prompt: Avoid discordant, harsh, or atonal characteristics
- Use automatic music context enhancement for better results
š Example Workflows
The repository includes example workflows:
example_workflow.json
- Basic text to audioaudiox_video_to_audio_workflow.json
- Video processingsimple_video_to_audio_workflow.json
- Simplified video to audio
āļø Requirements
- ComfyUI (latest version recommended)
- Python 3.8+
- CUDA-compatible GPU (recommended) or CPU
- Sufficient disk space for model downloads (models can be several GB)
- AudioX model files and config.json (must be downloaded separately)
š§ Configuration
Model Storage
Important: Models must be manually placed in the correct directory:
- Required Location:
ComfyUI/models/diffusion_models/
- Required Files:
- AudioX model file (
.safetensors
or.ckpt
) config.json
configuration file
- AudioX model file (
- Auto-Detection: The AudioX Model Loader automatically detects config files
Device Selection
- Automatic device detection (CUDA/MPS/CPU)
- Manual device specification available in Model Loader
- Memory-efficient processing options
Node Appearance
- AudioX nodes feature a distinctive light purple color (#ddaeff) for easy identification
- All nodes are categorized under "AudioX/" in the node browser
⨠Enhanced Features
Advanced Conditioning Controls
- Separate CFG Scales: Independent control over text and video conditioning strength
- Conditioning Weights: Fine-tune the balance between text prompts and video content
- Negative Prompting: Specify audio characteristics to avoid for better results
- Prompt Enhancement: Automatic addition of audio-specific keywords and context
Professional Audio Processing
- Volume Control with Step Size: Configurable precision from coarse (1.0 dB) to ultra-fine (0.001 dB)
- LUFS Normalization: Automatic loudness normalization to broadcast standards
- Soft Limiting: Intelligent limiting to prevent clipping while preserving dynamics
- Fade Controls: Smooth fade-in and fade-out with millisecond precision
Intelligent Prompt Processing
- Template System: Pre-defined templates for common audio scenarios (action, nature, music, urban)
- Style Modifiers: Cinematic, realistic, ambient, dramatic, peaceful, energetic
- Conditioning Modes: Standard, enhanced, super_enhanced, and multi_aspect processing
- Adaptive CFG: Automatically adjusts CFG scale based on prompt specificity
š Troubleshooting
Common Issues
Installation Problems:
- Missing ffmpeg: Install ffmpeg system dependency (see installation steps above)
- Build errors on Windows: Install Microsoft Visual C++ Build Tools
- Package conflicts: Use a fresh virtual environment:
python -m venv audiox_env && audiox_env\Scripts\activate
- Dependency failures: Run
python install_dependencies.py
to check and install missing packages
Model Not Found: If AudioX Model Loader shows no models:
- Ensure model files are in
ComfyUI/models/diffusion_models/
- Verify both model file and
model_config.json
are present - Check file permissions and naming
- Accept the license agreement on Hugging Face before downloading
Frontend Errors: If you encounter "beforeQueued" errors:
- Refresh browser (Ctrl+R)
- Clear browser cache
- Restart ComfyUI
- Check ComfyUI console for dependency errors
Memory Issues: For VRAM/RAM problems:
- Reduce batch sizes and duration_seconds
- Use CPU mode for large models
- Close other applications
- Try lower CFG scales (3.0-5.0)
- Ensure you have at least 6GB VRAM for optimal performance
Audio Processing Errors:
- Verify ffmpeg is properly installed and in PATH
- Check that libsndfile is installed (Linux/macOS)
- For LUFS normalization issues, ensure
pyloudnorm
is installed
š¤ Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
š License
MIT License - see LICENSE file for details.
š Acknowledgments
- AudioX team for original models and research
- ComfyUI community for the excellent framework
- All contributors and testers
š Version History
Current Version: v1.1.0
- ā Enhanced Conditioning: Added separate CFG scales, conditioning weights, and negative prompting
- ā Advanced Volume Control: LUFS normalization, soft limiting, and configurable step precision
- ā Enhanced Generation Nodes: Advanced text-to-audio, text-to-music, and video-to-audio nodes
- ā Intelligent Prompting: Template system, style modifiers, and adaptive CFG
- ā Professional Audio Processing: Fade controls, precise gain staging, and broadcast-standard normalization
- ā Improved UI: Distinctive node appearance with light purple color scheme
- ā Better Model Management: Auto-detection of config files and improved error handling
Previous Version: v1.0.9
- ā Fixed beforeQueued frontend errors
- ā Improved workflow execution stability
- ā Enhanced video processing capabilities
- ā Better error handling and user experience
šµ Audio Quality Features
Enhanced Conditioning
- Better Prompt Adherence: Enhanced conditioning modes ensure generated audio closely matches your descriptions
- Negative Prompting: Avoid unwanted audio characteristics like "muffled", "distorted", or "low quality"
- Balanced Generation: Fine-tune the balance between text prompts and video content for optimal results
Professional Audio Standards
- LUFS Normalization: Automatic loudness normalization to -23 LUFS (broadcast standard)
- Dynamic Range Preservation: Soft limiting maintains audio dynamics while preventing clipping
- Precise Control: Volume adjustments from coarse (1.0 dB) to ultra-fine (0.001 dB) steps
š Roadmap
Upcoming Features
- šØ Audio Inpainting: Fill gaps or replace sections in existing audio with AI-generated content
- š§ LoRA Training: Lightweight fine-tuning for custom audio styles and characteristics
- š Full Fine-tune Training: Complete model training pipeline for custom datasets and specialized audio domains
- ļæ½ Extended Model Support: Integration with additional AudioX model variants and architectures
Development Timeline
- Phase 1 (Current): Enhanced conditioning and professional audio processing ā
- Phase 2 (Next): Audio inpainting capabilities and LoRA training infrastructure
- Phase 3 (Future): Full fine-tuning pipeline and extended model support
We welcome community feedback and contributions to help prioritize these features!
For support and updates, visit the GitHub repository.