Voice Clone (QwenTTS):
The AILab_Qwen3TTSVoiceClone node is designed to facilitate the creation of synthetic voices that closely mimic a reference audio sample. This node leverages advanced voice cloning technology to generate speech that retains the unique characteristics and nuances of the original speaker's voice. By inputting a reference audio and target text, the node synthesizes a new audio output that sounds as if the original speaker is delivering the new content. This capability is particularly beneficial for applications requiring personalized voice synthesis, such as virtual assistants, audiobooks, and other multimedia content where maintaining a consistent voice identity is crucial. The node's functionality is enhanced by its ability to handle various languages and adjust parameters like temperature and repetition penalty to fine-tune the output's naturalness and variability.
Voice Clone (QwenTTS) Input Parameters:
reference_audio
The reference_audio parameter is used to provide the audio sample that the node will use as a reference for cloning the voice. This audio should be a clear recording of the voice you wish to replicate. The quality and clarity of this audio directly impact the accuracy and quality of the cloned voice. There is no specific minimum or maximum length for the audio, but longer samples may provide better results.
target_text
The target_text parameter specifies the text that you want the cloned voice to speak. This text will be synthesized into speech using the voice characteristics extracted from the reference_audio. There are no restrictions on the length of the text, but longer texts may require more processing time.
model_size
The model_size parameter determines the size of the model used for voice cloning. Larger models may provide more accurate and natural-sounding results but will require more computational resources. Common options might include "small", "medium", and "large", though specific options are not detailed in the context.
device
The device parameter specifies the hardware on which the model will run. Options typically include "cpu" or "gpu", with "auto" allowing the system to choose the best available option. Using a GPU can significantly speed up processing times.
precision
The precision parameter controls the numerical precision used during processing, with options like "bf16" (bfloat16) offering a balance between performance and accuracy. This setting can affect the speed and memory usage of the node.
language
The language parameter indicates the language of the target_text. This ensures that the synthesized speech uses appropriate phonetic and linguistic rules for the specified language. It is important to match this parameter with the language of the text for optimal results.
reference_text
The reference_text parameter is an optional input that provides the text content of the reference_audio. This can help improve the accuracy of the voice cloning process, especially if the reference audio is not entirely clear.
x_vector_only
The x_vector_only parameter is a boolean flag that, when set to true, limits the processing to extracting the x-vector from the reference audio. This is useful for scenarios where only the voice characteristics are needed without generating new speech.
voice
The voice parameter allows you to specify a pre-existing voice model to use as a base for cloning. This can be useful if you have a specific voice model that you want to adapt or modify.
unload_models
The unload_models parameter is a boolean flag that, when set to true, unloads the models from memory after processing. This can help manage memory usage, especially when working with large models or limited resources.
seed
The seed parameter is used to set the random seed for the generation process, ensuring reproducibility of results. A value of -1 indicates that no specific seed is set, allowing for variability in the output.
max_new_tokens
The max_new_tokens parameter defines the maximum number of tokens (or words) that can be generated in the output. This limits the length of the synthesized speech and can be adjusted based on the desired output length.
do_sample
The do_sample parameter is a boolean flag that, when set to true, enables sampling during generation, allowing for more varied and creative outputs. When false, the output is more deterministic.
top_p
The top_p parameter is used in nucleus sampling to control the diversity of the output. It specifies the cumulative probability threshold for token selection, with lower values leading to more conservative outputs.
top_k
The top_k parameter limits the number of tokens considered at each step during generation. A lower value results in more focused outputs, while a higher value allows for more diversity.
temperature
The temperature parameter controls the randomness of the output. Higher values result in more varied and creative outputs, while lower values produce more deterministic results.
repetition_penalty
The repetition_penalty parameter discourages the model from repeating the same phrases or words, enhancing the naturalness of the output. A value of 1.0 means no penalty, while higher values increase the penalty.
attention
The attention parameter specifies the attention mechanism used during processing. The "auto" setting allows the system to choose the best option based on the available resources and model configuration.
Voice Clone (QwenTTS) Output Parameters:
audio
The audio output parameter provides the synthesized speech audio that mimics the voice characteristics of the reference_audio while delivering the content of the target_text. This output is crucial for applications requiring personalized and consistent voice synthesis, as it allows you to generate new speech content that sounds as if it were spoken by the original speaker.
Voice Clone (QwenTTS) Usage Tips:
- Ensure that the
reference_audiois of high quality and free from background noise to achieve the best voice cloning results. - Experiment with the
temperatureandtop_pparameters to find the right balance between creativity and naturalness in the synthesized speech. - Use the
languageparameter to match the language of thetarget_textfor accurate phonetic rendering. - Consider using a GPU by setting the
deviceparameter to "gpu" for faster processing times, especially with larger models.
Voice Clone (QwenTTS) Common Errors and Solutions:
"Invalid reference audio format"
- Explanation: The provided
reference_audiois not in a supported format or is corrupted. - Solution: Ensure that the audio file is in a compatible format such as WAV or MP3 and is not corrupted. Re-record the audio if necessary.
"Model size not supported"
- Explanation: The specified
model_sizeis not available or supported by the system. - Solution: Check the available model sizes and select one that is supported. Common options might include "small", "medium", or "large".
"Language not recognized"
- Explanation: The
languageparameter is set to a language that is not supported by the model. - Solution: Verify the list of supported languages and ensure that the
languageparameter matches one of them.
"Out of memory error"
- Explanation: The system ran out of memory while processing the request, possibly due to large model size or insufficient resources.
- Solution: Try reducing the
model_size, or ensure that thedeviceis set to "gpu" if available. Additionally, consider closing other applications to free up memory.
