ComfyUI > Nodes > Step Audio EditX TTS > StepAudioEditX - Clone 🎤

ComfyUI Node: StepAudioEditX - Clone 🎤

Class Name

StepAudio_VoiceClone

Category
audio/step_audio
Author
saganaki22 (Account age: 1683days)
Extension
Step Audio EditX TTS
Latest Updated
2025-12-04
Github Stars
0.05K

How to Install Step Audio EditX TTS

Install this extension via the ComfyUI Manager by searching for Step Audio EditX TTS
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter Step Audio EditX TTS in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • 16GB VRAM to 80GB VRAM GPU machines
  • 400+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 200+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

StepAudioEditX - Clone 🎤 Description

Zero-shot voice cloning tool in ComfyUI for generating new audio from reference samples.

StepAudioEditX - Clone 🎤:

The StepAudio_VoiceClone node is a powerful tool designed for zero-shot voice cloning, allowing you to replicate a voice from a reference audio sample and generate new audio content in that cloned voice. This node is implemented natively in ComfyUI, ensuring a seamless and efficient experience without the need for JavaScript dependencies. It leverages advanced machine learning techniques to capture the unique characteristics of a voice from a short audio clip and apply them to new text, producing high-quality 24kHz audio output using the CosyVoice vocoder. This capability is particularly beneficial for AI artists and creators looking to maintain consistency in voiceovers or create personalized audio content without requiring extensive voice samples.

StepAudioEditX - Clone 🎤 Input Parameters:

prompt_text

The prompt_text parameter is the transcript of the reference audio. It serves as a textual representation of the audio content you wish to clone. This parameter is crucial as it helps the model understand the phonetic and linguistic characteristics of the reference voice. The input should be a non-empty string, and providing an accurate transcript will enhance the quality of the voice cloning process.

target_text

The target_text parameter specifies the text you want to generate in the cloned voice. This is the new content that will be spoken in the voice captured from the reference audio. It is essential to provide a clear and concise text input, as this will directly influence the output audio. The input should be a non-empty string to ensure successful execution.

model_path

The model_path parameter indicates the location of the voice cloning model to be used. This path must be correctly specified to load the appropriate model for the cloning process. If the model is not found, the node will not function correctly, so ensure the path is accurate and the model is available.

device

The device parameter determines the hardware on which the model will run, such as cpu or cuda for GPU acceleration. Selecting the appropriate device can significantly impact the performance and speed of the voice cloning process.

torch_dtype

The torch_dtype parameter specifies the data type used by PyTorch during model execution, such as float32 or float16. This can affect the precision and memory usage of the model, with lower precision types potentially offering faster performance at the cost of some accuracy.

quantization

The quantization parameter controls whether model quantization is applied, which can reduce the model size and speed up inference. This is particularly useful for running models on devices with limited resources.

attention_mechanism

The attention_mechanism parameter defines the type of attention mechanism used in the model, which can influence the model's ability to focus on different parts of the input text during generation. This can affect the naturalness and coherence of the output audio.

temperature

The temperature parameter is a float value that controls the randomness of the audio generation process. Lower values result in more deterministic outputs, while higher values introduce more variability and creativity in the generated audio.

do_sample

The do_sample parameter is a boolean that determines whether sampling is used during audio generation. Enabling sampling can lead to more diverse outputs, while disabling it results in more consistent and predictable audio.

max_new_tokens

The max_new_tokens parameter sets the maximum number of tokens to generate in the output audio. This limits the length of the generated content and can be adjusted based on the desired output length.

longform_chunking

The longform_chunking parameter is a boolean that enables chunking for long-form audio generation. This helps manage memory usage and maintain quality when generating extended audio content.

seed

The seed parameter is an integer used to initialize the random number generator, ensuring reproducibility of the audio generation process. By setting a specific seed, you can achieve consistent results across different runs.

keep_model_in_vram

The keep_model_in_vram parameter is a boolean that determines whether the model should remain in VRAM after execution. Keeping the model in VRAM can speed up subsequent operations but may consume more memory.

prompt_audio

The prompt_audio parameter is the reference audio input, provided as a ComfyUI AUDIO dictionary. This audio sample is analyzed to extract the voice characteristics needed for cloning. It is a mandatory input, and the absence of this parameter will result in an error.

StepAudioEditX - Clone 🎤 Output Parameters:

audio

The audio output parameter is a ComfyUI AUDIO dictionary containing the generated audio in the cloned voice. This output represents the successful application of the voice cloning process, where the target text is spoken in the voice captured from the reference audio. The quality and fidelity of this output depend on the accuracy of the input parameters and the model's capabilities.

StepAudioEditX - Clone 🎤 Usage Tips:

  • Ensure that the prompt_text accurately reflects the content of the prompt_audio to improve the quality of the voice cloning.
  • Use a higher temperature value for more creative and varied audio outputs, but lower it for more consistent and predictable results.
  • If you experience memory issues, consider enabling quantization or using a lower precision torch_dtype to reduce resource usage.
  • For long-form audio generation, enable longform_chunking to manage memory and maintain audio quality.

StepAudioEditX - Clone 🎤 Common Errors and Solutions:

"prompt_audio is required. Please connect an audio source."

  • Explanation: This error occurs when the prompt_audio parameter is not provided, which is essential for the voice cloning process.
  • Solution: Ensure that you connect a valid audio source to the prompt_audio parameter before executing the node.

"prompt_text cannot be empty. Please provide the transcript of the reference audio."

  • Explanation: This error indicates that the prompt_text parameter is empty, which is necessary for understanding the reference audio.
  • Solution: Provide a non-empty transcript of the reference audio in the prompt_text parameter.

"target_text cannot be empty. Please provide the text to generate."

  • Explanation: This error occurs when the target_text parameter is empty, which is required for generating new audio content.
  • Solution: Enter the text you wish to generate in the target_text parameter.

"Step Audio not available: <error_msg>"

  • Explanation: This error suggests that the Step Audio installation is incomplete or incorrect.
  • Solution: Verify the installation of Step Audio and ensure all dependencies are correctly set up.

"Model not found: <model_path>"

  • Explanation: This error indicates that the specified model path is incorrect or the model is missing.
  • Solution: Check the model_path parameter to ensure it points to a valid and existing model file.

StepAudioEditX - Clone 🎤 Related Nodes

Go back to the extension to check out more related nodes.
Step Audio EditX TTS
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.