ComfyUI Extension: ComfyUI-SRT-subtitles-VoxCPM
ComfyUI nodes for generating and editing speech from SRT subtitle files using VoxCPM text-to-speech model, supporting multi-speaker dialogue and audio replacement workflows.
Custom Nodes (0)
README
ComfyUI-SRTVoxCPM
A set of ComfyUI nodes based on VoxCPM for generating and editing speech from SRT subtitle files.
中文说明
Instructions
Node Descriptions
1. VoxCPM Loader
- Function: Loads the VoxCPM TTS model.
- Parameters:
model_name: Select the model to load. The default option,openbmb/VoxCPM-0.5B (Auto-Download), will automatically download the model from Hugging Face.optimize: Enablestorch.compilefor optimization. This feature may not work directly on Windows due to some bugs, but it is retained as it might be fixed by the community.
2. VoxCPM Cache Builder
- Function: Creates a voiceprint (feature cache) for a specified speaker.
- Parameters:
speaker_name: Manually enter a unique identifier for the speaker (e.g.,speaker1). This name must exactly match the speaker prefix used in the corresponding lines of the SRT file.prompt_audio: Input the reference audio. You can use the built-inLoad Audionode in ComfyUI or use theAudio Trimmernode to extract a specific segment from a longer audio file.prompt_text: Enter the transcript of the reference audio.
3. VoxCPM Cache Combiner (Chainable)
- Function: Combines the voice caches of multiple speakers to support multi-character dialogues.
- How to use:
- This node can be chained together, with each node adding one speaker.
- The
cache_groupinput of the firstCache Combinernode should be left unconnected. - The
cache_groupinput of subsequent nodes should be connected to the output of the previousCache Combinernode. - The output of the final node connects to the
cache_groupinput of theSRT ProcessororSRT Dubbernode.
4. Audio Trimmer (by Timestamp)
- Function: Precisely trims an audio clip based on a timestamp.
- Parameters:
timestamp: Enter a timestamp in the format00:00:06,500 --> 00:00:08,000. The node will output only the audio within this time range.
5. VoxCPM SRT Processor (from Scratch)
- Function: Generates a complete dialogue audio from scratch based on an SRT subtitle file and the voice caches.
- SRT Format:
- Multi-speaker mode: Before each line of dialogue in the SRT file, use the format
SpeakerID + Spaceto differentiate characters. For example:1 00:00:00,500 --> 00:00:05,000 speaker1 Hello world! 2 00:00:06,500 --> 00:00:08,000 speaker2 Hello, world! - Single-speaker mode: When there is only one speaker, no prefix is needed before the subtitle text. For example:
1 00:00:00,500 --> 00:00:05,000 Hello world!
- Multi-speaker mode: Before each line of dialogue in the SRT file, use the format
6. VoxCPM SRT Dubber (Replace Audio)
- Function: Replaces the speech for specific subtitle entries in an existing audio file. This can be used to correct pronunciation, change lines, or replace a character's voice entirely.
- Parameters:
entries_to_replace: Enter the subtitle numbers from the SRT file that you want to replace, separated by spaces (e.g.,1 3 5). The node will generate new audio using the provided voice and the text from the corresponding subtitle number, then replace it at the correct time in the original audio.- Note: This refers to the subtitle number, not its order in the file. For example, in a non-standard subtitle file with numbers
1 2 3 5 6, entering4will not match anything. If a number is duplicated (e.g.,1 2 3 4 4 5), entering4will process both entries numbered4.
Common Parameters
normalize_text: Normalizes the text before synthesis. For example, when enabled, the number50will be read as "fifty" instead of "five zero".stretch_method: Method for time-stretching the audio to align the generated speech with the subtitle's duration.none: No stretching is applied. If the generated audio is longer than the subtitle duration, it will overlap with the next line.librosa: Uses thelibrosalibrary for time-stretching. The quality can be inconsistent; you can adjuststretch_n_fftandstretch_hop_lengthto mitigate artifacts like "metallic" sounds.pydub: Uses thepydublibrary, which generally produces better results thanlibrosa. This method requires FFmpeg to be installed and configured in your system's PATH.
cfg_value: Defaults to2.0, which is a balanced setting. Higher values can sometimes improve results but may lead to instability.inference_timesteps: The number of inference steps.10steps can produce good results, but more steps can further improve audio quality.retry_threshold: The threshold for triggering a retry. The model compares the length ratio of the generated audio to the input text. If this ratio exceeds the threshold (meaning the audio is too long for the text), it's considered a failure and triggers a retry. For very slow speakers, you may need to increase this value (e.g., to8.0or10.0).retry_max_attempts: The maximum number of retries. When a generation fails, the model discards the result and tries again with a new random seed. Set to0to disable this feature.
Model Download
- Auto-Download (Recommended): In the
VoxCPM Loadernode, selectopenbmb/VoxCPM-0.5B (Auto-Download). The model will be automatically downloaded and cached in ComfyUI'smodels/TTSfolder. - Manual Download: You can also download all model files from VoxCPM-0.5B on Hugging Face and place them in a subfolder within ComfyUI's
models/TTSdirectory, for example:\ComfyUI\models\TTS\VoxCPM-0.5B.
Example Workflows
1. Generate Audio from SRT (SRT to Speech)
- Workflow File: SRT Processor Workflow
- Screenshot:

2. Edit Audio with SRT (Dubbing)
- Workflow File: SRT Dubber Workflow
- Screenshot:

Example
Additional Notes
I have only a little knowledge of python code. This node is based on VoxCPM, and the code was written with the assistance of Gemini 2.5 Pro.
The node still has some areas for improvement (like torch.compile compatibility and the model offloading mechanism). Due to limitations in my personal time and skills, I warmly welcome anyone in the community to freely use, modify, and improve this node, while respecting the VoxCPM license. Your contributions are appreciated!