ComfyUI Extension: Audio Separation
Audio separation (demixing) using AI. Separating Voices, Instruments, Bass, Drums, etc. CLI and ComfyUI
Custom Nodes (0)
README
Audio Separation 🎤🎛️
Overview
AudioSeparation is both, a ComfyUI group of nodes and a command line tool, to do audio demixing, also known as audio separation.
From an audio the objective is to separate the vocals, instruments, drums, bass, etc. from the rest of the sounds.
To achieve this we use MDX Net and Demucs neural networks (models). AudioSeparation currently supports 46 models mostly collected by the UVR5 project.
The MDX models are small (from 21 MB to 65 MB), but really efficient, the Demucs models are bigger (from 84 MB to 870) MB, but slightly better, and supports 4 stems. MDX models specialized on different stems are provided. We support more than one model for each task because some times a model will perform better for a song and worst for others.
We support two karaoke models. They are slightly different to regular models because they try to keep the secondary vocals along with the instruments.
The objectives for these nodes are:
✅ Multiple stems (Vocals, Instruments, Drums, Bass, etc.)
✅ Easy of use
✅ Clear download (with progress and known destination)
✅ Support for all possible input audio formats (mono/stereo, any sample rate, any batch size)
✅ Good quality vs size, or you can choose better quality using Demucs
✅ Reduced dependencies, no need to install extra Python modules when using ComfyUI
✅ Multiple examples
📜 Table of Contents
- Overview
- 🚀 Installation
- 📦 Dependencies
- 🖼️ Usage
- ✨ Nodes
- 🖼️ Examples
- 📝 Usage Notes
- 📜 Project History
- ⚖️ License
- 🙏 Attributions
🚀 Installation
ComfyUI Nodes
You can install the nodes from the ComfyUI nodes manager, the name is Audio Separation, or just do it manually:
- Clone this repository into your
ComfyUI/custom_nodes/
directory:cd ComfyUI/custom_nodes/ git clone https://github.com/set-soft/AudioSeparation
- Restart ComfyUI.
The nodes should then appear under the "audio/separation" category in the "Add Node" menu. You don't need to install extra dependencies.
Command Line Tool
-
Clone this repository
-
Change to its directory (
cd AudioSeparation
) -
Install the dependencies, only needed if ComfyUI isn't installed:
pip3 install -r requirements.txt
or
pip install -r requirements.txt
- Run the scripts like this:
python3 tool/demix.py AUDIO_FILE
or
python tool/demix.py AUDIO_FILE
You don't need to install it, you could even add a symlink in /usr/bin
.
A list of all the available tools can be found here.
Models
Models are automatically downloaded.
When using ComfyUI they are downloaded to ComfyUI/models/audio/MDX
and ComfyUI/models/audio/Demucs
.
When using the command line the default is ../models
relative to the script, but you can specify another dir.
If you want to download the models manually you can get the safetensors files from here.
For the command line you can also download the ONNX files from other repos.
📦 Dependencies
These nodes just uses torchaudio
(part of PyTorch), numpy
for math, safetensors
to load models and tqdm
for progress bars.
All of them are used by ComfyUI, so you don't need to install any additional dependency on a ComfyUI setup.
The following are optional dependencies:
onnxruntime
: if you want to use the ONNX version of the models you'll need the ONNX runtime, but I don't see any advantage. Note that only the command line can handle them.requests
: this is a robust internet connection module. If not installed we use the Python'surllib
which is fine.colorama
: it helps to generate colored messages (for debug, warning and errors). If not installed we use the most common ANSI escape sequences, supported by most terminals. You might need it for Windows.
🖼️ Usage
ComfyUI
You can start using template workflows, go to the ComfyUI Workflow menu and then choose Browse Templates, look for Audio Separation
If you want to do it manually you'll find the nodes in the audio/separation category. Or you can use the search menu, double click in the canvas and then type MDX (or Demucs):
Choose a node to extract what you want, i.e. Vocals. The complement output for it will be the instruments, but using a node for Instrumental separation you'll usually get a better result than using the Complement output. In the case of Demucs models you get 4 or 6 stems at a time, the "UVR Demucs" is an exception, it just supports Vocals and Other.
Then simply connect your audio input to the node (i.e. LoadAudio node from Comfy core) and connect its output to some audio node (i.e. PreviewAudio or SaveAudio nodes from Comfy core). You'll get something like this:
Now choose a model in the MDX node. On a fresh install all models will have an arrow (⬇️) indicating it must be downloaded.
Now just run the workflow. That's all.
Note that after downloading a model its name will change and will show a disk (💾).
If for some reason the list of models gets out of sync just press R
to refresh ComfyUI and
select the correct name from the list.
[!TIP] Models are trained using 44.1 kHz audio, to get optimal results use this sample rate. The node will convert any sample rate to this.
Command Line
Use the demix.py
tool like this
$ python3 tool/demix.py AUDIO_FILE
To get all the available options use:
$ python3 tool/demix.py --help
✨ Nodes
Vocals using MDX
, Instrumental using MDX
, Bass using MDX
, Drums using MDX
and Various using MDX
share the same structure, so here is the first:
Vocals using MDX
- Display Name:
Vocals using MDX
- Internal Name:
AudioSeparateVocals
- Category:
audio/separation
- Description: Takes one audio input (which can be a batch) separates the vocals from the rest of the sounds.
- Inputs:
input_sound
(AUDIO): The audio input. Can be a single audio item or a batch.model
(COMBO): The name of the model to use. Choose one from the list.segments
(INT): How many segments to process at once. More segments needs more VRAM, but the audio might have less discontinuities.taget_device
(COMBO): The device where we will run the neural network.
- Output:
Vocals
(AUDIO): The separated stemComplement
(AUDIO): The input audio subtracting theVocals
- Behavior Details:
- Sample Rate: The sample rate of the input is adjusted to 44.1 kHz
- Channels: Mono audios are converted to fake stereo (left == right)
- Input Batch Handling: If
input_sound
is a batch the outputs will be batches. The process is sequential, not parallel. - Missing Models: They are downloaded and stored under
models/audio/MDX
of the ComfyUI installation
And here is the Demucs node:
Demucs Audio Separator
- Display Name:
Demucs Audio Separator
- Internal Name:
AudioSeparateDemucs
- Category:
audio/separation
- Description: Takes one audio input (which can be a batch) separates the vocals, drums and bass from the rest of the sounds. The node has outputs for guitar and piano, which can be separated by the Hybrid Transformer 6 sources model, which is quite experimental.
- Inputs:
input_sound
(AUDIO): The audio input. Can be a single audio item or a batch.model
(COMBO): The name of the model to use. Choose one from the list.shifts
(INT): Number of random shifts for equivariant stabilization. It does extra passes using slightly shifted audio, which can produce better results. Higher values improve quality but are slower. 0 disables it.overlap
(FLOAT): Amount of overlap between audio chunks. This is expressed as a portion of the total chunk, i.e. 0.25 is 25%. Higher values can reduce stitching artifacts but are slower.custom_segment
(BOOLEAN): Enable to override the model's default segment length. Disabling uses the recommended length from the model file. Useful for HDemucs and Demucs models, not much for HTDemucs.- segment (INT): Length of audio chunks to process at a time (in seconds). Higher values need more VRAM but can improve quality.
taget_device
(COMBO): The device where we will run the neural network.
- Output:
Vocals
(AUDIO): The separated vocalsDrums
(AUDIO): The separated drums. Not for "UVR" version.Bass
(AUDIO): The separated bass. Not for "UVR" version.Other
(AUDIO): The separated stuff that doesn't fit in the other outputsGuitar
(AUDIO): The separated guitar, only for Hybrid Transformer 6 sources model, which is quitePiano
(AUDIO): The separated piano, only for Hybrid Transformer 6 sources model, which is quite
- Behavior Details:
- Sample Rate: The sample rate of the input is adjusted to 44.1 kHz
- Channels: Mono audios are converted to fake stereo (left == right)
- Input Batch Handling: If
input_sound
is a batch the outputs will be batches. - Missing Models: They are downloaded and stored under
models/audio/Demucs
of the ComfyUI installation - Models: Note that most models are a bag of models, this is four models working together.
🖼️ Examples
Once installed the examples are available in the ComfyUI workflow templates, in the audio-separation section.
Note that we have two versions, the regular and the quick version. The quick version is ideal for quick tests, the input files are downloaded and, in most cases, only 10 seconds of audio are processed.
- 00_Vocals_simple.json Quick: Example to get vocals using MDX
- 01_Vocals_Drums_Bass.json Quick: Example to get vocals, drums, bass and others using MDX
- 02_Batch.json Quick: Shows how to apply MDX demix to a batch of audios, using Audio Batch nodes.
- 03_Instrumental_keep.json Quick: Shows how to extract vocals maintaining the same number of channels and sample rate, using Audio Batch nodes.
- 04_Demucs.json Quick: Separates vocals, drums, bass and others using Demucs, better quality.
- 05_Demix_and_Remix.json Quick: Example to separate vocals and others to the left channel and drums and bass to the right channel, using Audio Batch nodes.
📝 Usage Notes
- AUDIO Type: These nodes work with ComfyUI's standard "AUDIO" data type, which is a Python dictionary containing:
'waveform'
: Atorch.Tensor
of shape(batch_size, num_channels, num_samples)
.'sample_rate'
: Anint
representing the sample rate in Hz.
- Logging: 🔊 The nodes use Python's
logging
module. Debug messages can be helpful for understanding the transformations being applied. You can control log verbosity through ComfyUI's startup arguments (e.g.,--preview-method auto --verbose DEBUG
for more detailed ComfyUI logs which might also affect custom node loggers if they are configured to inherit levels). The logger name used is "AudioSeparation". You can force debugging level for these nodes defining theAUDIOSEPARATION_NODES_DEBUG
environment variable to1
. - Models format: We use safetensors because this format is safer than PyTorch files (.pth, .th, etc.) and doesn't need an extra runtime (like ONNX does)
- No quantized Demucs: These models just save download time, but pulls extra dependency (diffq), they are just lower quality versions of their non-quantized counterparts.
📜 Project History
-
1.0.0 2025-07-02: Initial release. MDX-Net models support
-
1.1.0 2025-07-11: Demucs models support
-
1.1.1 2025-07-21: More examples. One more Demucs model
⚖️ License
The models are under MIT license, but aren't in this repo
🙏 Attributions
- Main author: Salvador E. Tropea
- Assisted by Gemini 2.5 Pro, i.e. most of the inference class and ONNX to safetensors conversion
- Various ideas from DeepExtract by Abdullah Ozmantar
- Models collected by the UVR5 project and found in the UVR Resources by Artyom Bebroy
- Demucs models are from Meta Platforms, Inc. Except for the "UVR" version
- The logo image was created using text generated using Text Studio and resources from Vecteezy by: