Visit ComfyUI Online for ready-to-use ComfyUI environment
Generate expressive speech using VoxCPM model, mimicking or creating unique voices from text with advanced machine learning.
VoxCPM_TTS is a powerful node designed to generate speech or clone voices using the VoxCPM model, a tokenizer-free speech generation model. This node is part of the audio/tts category and is specifically crafted to transform text into highly expressive and natural-sounding speech. It leverages advanced machine learning techniques to synthesize audio that can either mimic a reference voice or create a unique vocal output based on the input text. The node is particularly beneficial for AI artists and developers looking to incorporate realistic voice synthesis into their projects, offering a seamless way to produce high-quality audio content without requiring extensive technical expertise.
This parameter allows you to select the specific VoxCPM model to use for speech generation. The choice of model can affect the style and quality of the generated speech, providing flexibility in tailoring the output to your needs. The default model is the first option in the list of available models.
The text parameter is the primary input for the text-to-speech conversion. It accepts multiline text, where each line is processed as a separate chunk. This allows for the synthesis of longer passages of text in a coherent manner. The default text is "VoxCPM is an innovative TTS model designed to generate highly expressive speech."
This optional parameter is used for voice cloning. By providing a reference audio file, the node can mimic the voice characteristics of the audio in the generated speech. This is particularly useful for creating personalized or character-specific voices.
The prompt_text parameter is optional and is used in conjunction with prompt_audio for voice cloning. It should contain the transcript of the reference audio, enabling the model to better understand and replicate the voice characteristics.
The cfg_value parameter controls the guidance scale, which influences how closely the generated speech adheres to the input text or prompt. Higher values result in speech that is more faithful to the prompt but may sound less natural. The default value is 2.0, with a range from 1.0 to 10.0.
This parameter determines the number of diffusion steps used during the synthesis process. More steps can improve the quality of the generated audio but will increase the processing time. The default is 10 steps, with a range from 1 to 100.
The normalize_text parameter enables text normalization, which is recommended for general text to ensure consistent and natural-sounding speech. It can be toggled on or off, with normalization enabled by default.
The seed parameter is used for reproducibility, allowing you to generate the same audio output from the same input parameters. A value of -1 will result in a random seed, while any other value will produce consistent results.
This parameter controls whether the VoxCPM model is offloaded from VRAM after generation. Enabling this can help manage memory usage, especially in environments with limited resources. By default, the model is auto-managed.
The waveform output is a tensor representing the generated audio signal. It is the primary output of the node, containing the synthesized speech in a format that can be played back or further processed.
The sample_rate output indicates the sample rate of the generated audio, which is set at 16000 Hz. This is a standard sample rate for speech audio, ensuring compatibility with most audio playback and processing systems.
prompt_text.cfg_value settings to find the right balance between adherence to the input text and naturalness of the speech.inference_timesteps parameter to fine-tune the quality and processing time of the audio generation, especially for longer or more complex text inputs.retry_max_attempts parameter to allow more retries for generating acceptable audio. Additionally, ensure that the input text and reference audio (if used) are well-aligned and of good quality.cfg_value, inference_timesteps, and seed values.RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.