StepAudioEditX - Clone 🎤

Zero-shot voice cloning tool in ComfyUI for generating new audio from reference samples.

StepAudioEditX - Clone 🎤:

The StepAudio_VoiceClone node is a powerful tool designed for zero-shot voice cloning, allowing you to replicate a voice from a reference audio sample and generate new audio content in that cloned voice. This node is implemented natively in ComfyUI, ensuring a seamless and efficient experience without the need for JavaScript dependencies. It leverages advanced machine learning techniques to capture the unique characteristics of a voice from a short audio clip and apply them to new text, producing high-quality 24kHz audio output using the CosyVoice vocoder. This capability is particularly beneficial for AI artists and creators looking to maintain consistency in voiceovers or create personalized audio content without requiring extensive voice samples.

StepAudioEditX - Clone 🎤 Input Parameters:

prompt_text

The prompt_text parameter is the transcript of the reference audio. It serves as a textual representation of the audio content you wish to clone. This parameter is crucial as it helps the model understand the phonetic and linguistic characteristics of the reference voice. The input should be a non-empty string, and providing an accurate transcript will enhance the quality of the voice cloning process.

target_text

The target_text parameter specifies the text you want to generate in the cloned voice. This is the new content that will be spoken in the voice captured from the reference audio. It is essential to provide a clear and concise text input, as this will directly influence the output audio. The input should be a non-empty string to ensure successful execution.

model_path

The model_path parameter indicates the location of the voice cloning model to be used. This path must be correctly specified to load the appropriate model for the cloning process. If the model is not found, the node will not function correctly, so ensure the path is accurate and the model is available.

device

The device parameter determines the hardware on which the model will run, such as cpu or cuda for GPU acceleration. Selecting the appropriate device can significantly impact the performance and speed of the voice cloning process.

torch_dtype

The torch_dtype parameter specifies the data type used by PyTorch during model execution, such as float32 or float16. This can affect the precision and memory usage of the model, with lower precision types potentially offering faster performance at the cost of some accuracy.

quantization

The quantization parameter controls whether model quantization is applied, which can reduce the model size and speed up inference. This is particularly useful for running models on devices with limited resources.

attention_mechanism

The attention_mechanism parameter defines the type of attention mechanism used in the model, which can influence the model's ability to focus on different parts of the input text during generation. This can affect the naturalness and coherence of the output audio.

temperature

The temperature parameter is a float value that controls the randomness of the audio generation process. Lower values result in more deterministic outputs, while higher values introduce more variability and creativity in the generated audio.

do_sample

The do_sample parameter is a boolean that determines whether sampling is used during audio generation. Enabling sampling can lead to more diverse outputs, while disabling it results in more consistent and predictable audio.

max_new_tokens

The max_new_tokens parameter sets the maximum number of tokens to generate in the output audio. This limits the length of the generated content and can be adjusted based on the desired output length.

longform_chunking

The longform_chunking parameter is a boolean that enables chunking for long-form audio generation. This helps manage memory usage and maintain quality when generating extended audio content.

seed

The seed parameter is an integer used to initialize the random number generator, ensuring reproducibility of the audio generation process. By setting a specific seed, you can achieve consistent results across different runs.

keep_model_in_vram

The keep_model_in_vram parameter is a boolean that determines whether the model should remain in VRAM after execution. Keeping the model in VRAM can speed up subsequent operations but may consume more memory.

prompt_audio

The prompt_audio parameter is the reference audio input, provided as a ComfyUI AUDIO dictionary. This audio sample is analyzed to extract the voice characteristics needed for cloning. It is a mandatory input, and the absence of this parameter will result in an error.