ComfyUI Extension: ComfyUI-FreeVC_wrapper

Authored by ShmuelRonen

Created 5 months ago

Updated 4 months ago

63 stars

A voice conversion extension node for ComfyUI based on a/FreeVC, enabling high-quality voice conversion capabilities within the ComfyUI framework.

Custom Nodes (0)

README

ComfyUI-FreeVC_wrapper

Support My Work

If you find this project helpful, consider buying me a coffee:

A voice conversion extension node for ComfyUI based on FreeVC, enabling high-quality voice conversion capabilities within the ComfyUI framework.

Features

Support for multiple FreeVC models:
- Standard models (16kHz): FreeVC, FreeVC-s
- High-quality model (24kHz): FreeVC (24kHz)
Enhanced voice mimicry capabilities
Advanced audio pre and post-processing options
Stereo and mono audio support
Automatic audio resampling
Integrated with ComfyUI's audio processing pipeline
GPU acceleration support (CUDA)

Installation

Install the extension in your ComfyUI's custom_nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-FreeVC_wrapper.git
cd ComfyUI-FreeVC_wrapper

Install required Python packages:

pip install librosa transformers numpy torch noisereduce

Download required checkpoints:

a. Voice Conversion Models: All model checkpoint files (3 models) are available in a single Google Drive folder: Download All Model Checkpoints (Google Drive)

After downloading, extract the file and place the checkpoints folder in the freevc directory:

ComfyUI-FreeVC_wrapper/freevc/

b. Speaker Encoder: Download the speaker encoder checkpoint from HuggingFace and place it in the custom_nodes/ComfyUI-FreeVC_wrapper/freevc/speaker_encoder/ckpt directory:

| Component | Filename | Required For | |-----------|----------|--------------| | Speaker Encoder | pretrained_bak_5805000.pt | FreeVC, FreeVC (24kHz), D-FreeVC, and D-FreeVC (24kHz) models |

Direct download link:

pretrained_bak_5805000.pt

Your final directory structure should look like this:

ComfyUI-FreeVC_wrapper/
├── freevc/
    ├── checkpoints/
    │   ├── freevc.pth         # Standard 16kHz model
    │   ├── freevc-s.pth       # Source-filtering based model
    │   ├── freevc-24.pth      # High-quality 24kHz model
    │  
    └── speaker_encoder/
        └── ckpt/
            └── pretrained_bak_5805000.pt  # Speaker encoder checkpoint

Usage

In ComfyUI, locate the "FreeVC Voice Converter v2 🎤" node under the "audio/voice conversion" category
Connect your inputs:
- Source audio: The audio you want to convert
- Reference audio: The target voice style
- (Optional) Secondary reference: Additional reference for more robust voice matching
- Select model type: Choose between standard and diffusion-enhanced models
Configure the conversion parameters:
- Source processing: Noise reduction, source neutralization, clarity enhancement
- Conversion settings: Temperature, diffusion parameters (for diffusion models)
- Post-processing: Voice matching strength, presence boost, normalization
Connect the output to your desired audio output node

Model Selection Guide

FreeVC: Good for general purpose voice conversion at 16kHz
FreeVC-s: Better preservation of source speech content, recommended for maintaining clarity
FreeVC (24kHz): Higher quality output with better audio fidelity

Tips for Better Voice Conversion

Use longer reference samples: 5-10 seconds of clean speech works best
Try multiple reference samples: Use the secondary reference input for more robust voice profiles
Adjust voice mimicry settings:
- Increase voice_match_strength (0.6-0.8) for stronger character matching
- Use neutralize_source (0.3-0.5) to reduce source voice influence
- Add presence_boost (0.3-0.5) for more "in the room" sound

Known Issues and Troubleshooting

File Not Found Errors:
- Ensure all checkpoint files are in the correct directory
- Verify file names match exactly (case-sensitive)
CUDA Out of Memory:
- Try processing shorter audio clips
- Use CPU if GPU memory is insufficient
- Lower diffusion steps for diffusion-based models
Audio Quality Issues:
- Try different models - each has strengths for different source/target voices
- For diffusion models, lower the noise coefficient if there's static
- Increase clarity_enhancement for better intelligibility

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Original FreeVC implementation by OlaWod
ComfyUI framework by comfyanonymous

Citation

If you use this in your research, please cite:

@article{wang2023freevc,
  title={FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion},
  author={Wang, Jiarui and Chen, Shilong and Wu, Yu and Zhang, Pan and Xie, Lei},
  journal={arXiv preprint arXiv:2210.15418},
  year={2023}
}