Wan 2.2 Video Restyle | First Frame Restyle for Consistent and Cinematic Video Generation

Change the first frame, folks, your style makes the whole video look amazing. Pure magic.

ComfyUI Grounding | Object Tracking Workflow

Track any subject with pixel-perfect accuracy for stunning VFX results.

Qwen Image 2512 LoRA Inference | AI Toolkit ComfyUI

Use an AI Toolkit-trained LoRA with Qwen Image 2512 in ComfyUI via one RCQwenImage2512 node for preview-aligned generations.

ACE++ Character Consistency

Generate consistent images of your character across poses, angles, and styles from a single photo.

ComfyUI > Nodes > ComfyUI-VoxCPM > VoxCPM TTS

ComfyUI Node: VoxCPM TTS

Class Name

VoxCPM_TTS

Category
audio/tts

Author
wildminder (Account age: 4772days) Extension
ComfyUI-VoxCPM Latest Updated
2025-12-15 Github Stars
0.34K

Github Ask wildminder Current Questions Past Questions

Table of Content

Description
VoxCPM_TTS:
VoxCPM_TTS Input Parameters:
VoxCPM_TTS Output Parameters:
VoxCPM_TTS Usage Tips:
VoxCPM_TTS Common Errors and Solutions:
Related Nodes

How to Install ComfyUI-VoxCPM

Install this extension via the ComfyUI Manager by searching for ComfyUI-VoxCPM

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI-VoxCPM in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

VoxCPM TTS Description

Generate expressive speech using VoxCPM model, mimicking or creating unique voices from text with advanced machine learning.

VoxCPM TTS:

VoxCPM_TTS is a powerful node designed to generate speech or clone voices using the VoxCPM model, a tokenizer-free speech generation model. This node is part of the audio/tts category and is specifically crafted to transform text into highly expressive and natural-sounding speech. It leverages advanced machine learning techniques to synthesize audio that can either mimic a reference voice or create a unique vocal output based on the input text. The node is particularly beneficial for AI artists and developers looking to incorporate realistic voice synthesis into their projects, offering a seamless way to produce high-quality audio content without requiring extensive technical expertise.

VoxCPM TTS Input Parameters:

model_name

This parameter allows you to select the specific VoxCPM model to use for speech generation. The choice of model can affect the style and quality of the generated speech, providing flexibility in tailoring the output to your needs. The default model is the first option in the list of available models.

text

The text parameter is the primary input for the text-to-speech conversion. It accepts multiline text, where each line is processed as a separate chunk. This allows for the synthesis of longer passages of text in a coherent manner. The default text is "VoxCPM is an innovative TTS model designed to generate highly expressive speech."

prompt_audio

This optional parameter is used for voice cloning. By providing a reference audio file, the node can mimic the voice characteristics of the audio in the generated speech. This is particularly useful for creating personalized or character-specific voices.

prompt_text

The prompt_text parameter is optional and is used in conjunction with prompt_audio for voice cloning. It should contain the transcript of the reference audio, enabling the model to better understand and replicate the voice characteristics.

cfg_value

The cfg_value parameter controls the guidance scale, which influences how closely the generated speech adheres to the input text or prompt. Higher values result in speech that is more faithful to the prompt but may sound less natural. The default value is 2.0, with a range from 1.0 to 10.0.

inference_timesteps

This parameter determines the number of diffusion steps used during the synthesis process. More steps can improve the quality of the generated audio but will increase the processing time. The default is 10 steps, with a range from 1 to 100.

normalize_text

The normalize_text parameter enables text normalization, which is recommended for general text to ensure consistent and natural-sounding speech. It can be toggled on or off, with normalization enabled by default.

seed

The seed parameter is used for reproducibility, allowing you to generate the same audio output from the same input parameters. A value of -1 will result in a random seed, while any other value will produce consistent results.

force_offload

This parameter controls whether the VoxCPM model is offloaded from VRAM after generation. Enabling this can help manage memory usage, especially in environments with limited resources. By default, the model is auto-managed.

VoxCPM TTS Output Parameters:

waveform

The waveform output is a tensor representing the generated audio signal. It is the primary output of the node, containing the synthesized speech in a format that can be played back or further processed.

sample_rate

The sample_rate output indicates the sample rate of the generated audio, which is set at 16000 Hz. This is a standard sample rate for speech audio, ensuring compatibility with most audio playback and processing systems.

VoxCPM TTS Usage Tips:

To achieve the best results in voice cloning, provide a clear and high-quality reference audio file along with an accurate transcript in prompt_text.
Experiment with different cfg_value settings to find the right balance between adherence to the input text and naturalness of the speech.
Use the inference_timesteps parameter to fine-tune the quality and processing time of the audio generation, especially for longer or more complex text inputs.

VoxCPM TTS Common Errors and Solutions:

"Bad case detected (audio/text ratio too high), retrying..."

Explanation: This error occurs when the generated audio does not meet the expected quality standards, often due to a mismatch between the audio and text lengths.
Solution: Increase the retry_max_attempts parameter to allow more retries for generating acceptable audio. Additionally, ensure that the input text and reference audio (if used) are well-aligned and of good quality.

"Audio generation failed due to invalid input parameters"

Explanation: This error indicates that one or more input parameters are outside their valid range or incorrectly configured.
Solution: Double-check all input parameters to ensure they are within the specified ranges and correctly set. Pay particular attention to cfg_value, inference_timesteps, and seed values.

VoxCPM TTS Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI-VoxCPM

Table of Content

Description
VoxCPM_TTS:
VoxCPM_TTS Input Parameters:
VoxCPM_TTS Output Parameters:
VoxCPM_TTS Usage Tips:
VoxCPM_TTS Common Errors and Solutions:
Related Nodes

Hunyuan3D-2 | Leading-edge 3D Assets Generator

Generate precise textured 3D assets from images with state-of-the-art AI technology.

Qwen-Image Lightning | 8-Step Speed Boost

Cut generation time in half.

Flux Kontext Zoom Out ComfyUI Workflow | Seamless Outpainting

Zoom Out LoRA enlarges images seamlessly with natural continuation.

Flux 2 Dev | Photoreal Text-to-Image Generator

Next-level image realism with advanced generation control power

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.