ComfyUI Extension: ComfyUI FLOAT

Authored by yuvraj108c

Created

Updated

205 stars

This project provides an unofficial ComfyUI implementation of a/FLOAT for Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Custom Nodes (0)

    README

    <div align="center">

    ComfyUI FLOAT

    python arXiv by-nc-sa/4.0

    </div>

    This project provides a ComfyUI wrapper of FLOAT for Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

    For a more advanced and maintained version, check out: ComfyUI-FLOAT_Optimized

    <div align="center"> <video src="https://github.com/user-attachments/assets/36626b4a-d3e5-4db9-87a7-ca0e949daee0" /> </div>

    ⭐ Support

    If you like my projects and wish to see updates and new features, please consider supporting me. It helps a lot!

    ComfyUI-Depth-Anything-Tensorrt ComfyUI-Upscaler-Tensorrt ComfyUI-Dwpose-Tensorrt ComfyUI-Rife-Tensorrt

    ComfyUI-Whisper ComfyUI_InvSR ComfyUI-FLOAT ComfyUI-Thera ComfyUI-Video-Depth-Anything ComfyUI-PiperTTS

    buy-me-coffees paypal-donation

    🚀 Installation

    git clone https://github.com/yuvraj108c/ComfyUI-FLOAT.git
    cd ./ComfyUI-FLOAT
    pip install -r requirements.txt
    

    ☀️ Usage

    • Load example workflow
    • Upload driving image and audio, click queue
    • Models autodownload to /ComfyUI/models/float
    • The models are organized as follows:
      |-- float.pth                                       # main model
      |-- wav2vec2-base-960h/                             # audio encoder
      |   |-- config.json
      |   |-- model.safetensors
      |   |-- preprocessor_config.json
      |-- wav2vec-english-speech-emotion-recognition/     # emotion encoder
          |-- config.json
          |-- preprocessor_config.json
          |-- pytorch_model.bin
      

    🛠️ Parameters

    • ref_image: Reference image with a face (must have batch size 1)

    • ref_audio: Reference audio (For long audios (e.g 3+ minutes), ensure that you have enough ram/vram)

    • a_cfg_scale: Audio classifier-free guidance scale (default:2)

    • r_cfg_scale: Reference classifier-free guidance scale (default:1)

    • emotion: none, angry, disgust, fear, happy, neutral, sad, surprise (default:none)

    • e_cfg_scale: Intensity of emotion (default:1). For more emotion intensive video, try large value from 5 to 10

    • crop: Enable only if the reference image does not have a centered face

    • fps: Frame rate of the output video (default:25)

    Citation

    @article{ki2024float,
      title={FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait},
      author={Ki, Taekyung and Min, Dongchan and Chae, Gyeongsu},
      journal={arXiv preprint arXiv:2412.01064},
      year={2024}
    }
    

    Acknowledgments

    Thanks to simplepod.ai for providing GPU servers

    License

    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)