ComfyUI Extension: Audio Separation

Authored by set-soft

Created

Updated

0 stars

Audio separation (demixing) using AI. Separating Voices, Instruments, Bass, Drums, etc. CLI and ComfyUI

Custom Nodes (0)

    README

    <div align="center">

    Audio Separation &#x0001F3A4;&#x0001F39B;️

    Audio Separation Logo

    </div>

    Overview

    AudioSeparation is both, a ComfyUI group of nodes and a command line tool, to do audio demixing, also known as audio separation.

    From an audio the objective is to separate the vocals, instruments, drums, bass, etc. from the rest of the sounds.

    To achieve this we use MDX Net neural networks (models). AudioSeparation currently supports 39 models collected by the UVR5 project.

    The models are small (from 21 MB to 65 MB), but really efficient. Models specialized on different stems are provided. We support more than one model for each task because some times a model will perform better for a song and worst for others.

    We support two karaoke models. They are slightly different to regular models because they try to keep the secondary vocals along with the instruments.

    The objectives for these nodes are:

    • Multiple stems (Vocals, Instruments, Drums, Bass, etc.)
    • Easy of use
    • Clear download (with progress and known destination)
    • Support for all possible input audio formats (mono/stereo, any sample rate, any batch size)
    • Good quality vs size
    • Reduced dependencies

    &#x0001F4DC; Table of Contents


    &#x0001F680; Installation

    ComfyUI Nodes

    You can install the nodes from the ComfyUI nodes manager, the name is Audio Separation, or just do it manually:

    1. Clone this repository into your ComfyUI/custom_nodes/ directory:
      cd ComfyUI/custom_nodes/
      git clone https://github.com/set-soft/AudioSeparation
      
    2. Restart ComfyUI.

    The nodes should then appear under the "audio/separation" category in the "Add Node" menu. You don't need to install extra dependencies.

    Command Line Tool

    1. Clone this repository

    2. Change to its directory (cd AudioSeparation)

    3. Install the dependencies, only needed if ComfyUI isn't installed:

    pip3 install -r requirements.txt
    

    or

    pip install -r requirements.txt
    
    1. Run the scripts like this:
    python3 tool/demix.py AUDIO_FILE
    

    or

    python tool/demix.py AUDIO_FILE
    

    You don't need to install it, you could even add a symlink in /usr/bin.

    A list of all the available tools can be found here.

    Models

    Models are automatically downloaded.

    When using ComfyUI they are downloaded to ComfyUI/models/audio/MDX.

    When using the command line the default is ../models relative to the script, but you can specify another dir.

    If you want to download the models manually you can get the safetensors files from here.

    For the command line you can also download the ONNX files from other repos.

    &#x0001F4E6; Dependencies

    These nodes just uses torchaudio (part of PyTorch), numpy for math and tqdm for progress bars. All of them are used by ComfyUI, so you don't need to install any additional dependency on a ComfyUI setup.

    The following are optional dependencies:

    • onnxruntime: if you want to use the ONNX version of the models you'll need the ONNX runtime, but I don't see any advantage. Note that only the command line can handle them.
    • requests: this is a robust internet connection module. If not installed we use the Python's urllib which is fine.
    • colorama: it helps to generate colored messages (for debug, warning and errors). If not installed we use the most common ANSI escape sequences, supported by most terminals. You might need it for Windows.

    &#x0001F5BC;️ Usage

    ComfyUI

    You can start using template workflows, go to the ComfyUI Workflow menu and then choose Browse Templates, look for Audio Separation

    If you want to do it manually you'll find the nodes in the audio/separation category. Or you can use the search menu, double click in the canvas and then type MDX:

    ComfyUI Search

    Choose a node to extract what you want, i.e. Vocals. The complement output for it will be the instruments, but using a node for Instrumental separation you'll usually get a better result than using the Complement output.

    Then simply connect your audio input to the node (i.e. LoadAudio node from Comfy core) and connect its output to some audio node (i.e. PreviewAudio or SaveAudio nodes from Comfy core). You'll get something like this:

    ComfyUI Search

    Now choose a model in the MDX node. On a fresh install all models will have an arrow (⬇️) indicating it must be downloaded.

    Now just run the workflow. That's all.

    Note that after downloading a model its name will change and will show a disk (&#x0001F4BE;). If for some reason the list of models gets out of sync just press R to refresh ComfyUI and select the correct name from the list.

    [!TIP] Models are trained using 44.1 kHz audio, to get optimal results use this sample rate. The node will convert any sample rate to this.

    Command Line

    Use the demix.py tool like this

    $ python3 tool/demix.py AUDIO_FILE
    

    To get all the available options use:

    $ python3 tool/demix.py --help
    

    ✨ Nodes

    Vocals using MDX, Instrumental using MDX, Bass using MDX, Drums using MDX and Various using MDX share the same structure, so here is the first:

    Vocals using MDX

    • Display Name: Vocals using MDX
    • Internal Name: AudioSeparateVocals
    • Category: audio/separation
    • Description: Takes one audio input (which can be a batch) separates the vocals from the rest of the sounds.
    • Inputs:
      • input_sound (AUDIO): The audio input. Can be a single audio item or a batch.
      • model (COMBO): The name of the model to use. Choose one from the list.
      • segments (INT): How many segments to process at once. More segments needs more VRAM, but the audio might have less discontinuities.
      • taget_device (COMBO): The device where we will run the neural network.
    • Output:
      • Vocals (AUDIO): The separated stem
      • Complement (AUDIO): The input audio subtracting the Vocals
    • Behavior Details:
      • Sample Rate: The sample rate of the input is adjusted to 44.1 kHz
      • Channels: Mono audios are converted to fake stereo (left == right)
      • Input Batch Handling: If input_sound is a batch the outputs will be batches. The process is sequential, not parallel.
      • Missing Models: They are downloaded and stored under models/audio/MDX of the ComfyUI installation

    &#x0001F4DD; Usage Notes

    • AUDIO Type: These nodes work with ComfyUI's standard "AUDIO" data type, which is a Python dictionary containing:
      • 'waveform': A torch.Tensor of shape (batch_size, num_channels, num_samples).
      • 'sample_rate': An int representing the sample rate in Hz.
    • Logging: &#x0001F50A; The nodes use Python's logging module. Debug messages can be helpful for understanding the transformations being applied. You can control log verbosity through ComfyUI's startup arguments (e.g., --preview-method auto --verbose DEBUG for more detailed ComfyUI logs which might also affect custom node loggers if they are configured to inherit levels). The logger name used is "AudioSeparation". You can force debugging level for these nodes defining the AUDIOSEPARATION_NODES_DEBUG environment variable to 1.

    ⚖️ License

    GPL-3.0

    The models are under MIT license, but aren't in this repo

    &#x0001F64F; Attributions

    • Main author: Salvador E. Tropea
    • Assisted by Gemini 2.5 Pro, i.e. most of the inference class and ONNX to safetensors conversion