Visit ComfyUI Online for ready-to-use ComfyUI environment
Advanced voice cloning node with pitch and speed control for personalized text-to-speech outputs in English and Chinese.
The SparkTTS_AdvVoiceClone node is a powerful tool designed for advanced voice cloning, allowing you to replicate a voice from a reference audio sample with additional control over pitch and speed. This node is particularly beneficial for creating personalized and dynamic text-to-speech outputs, as it enables you to fine-tune the vocal characteristics to match specific needs or artistic visions. By leveraging the capabilities of SparkTTS, this node supports both English and Chinese languages, making it versatile for a wide range of applications. The main goal of this node is to provide a high-quality voice cloning experience that can be customized to suit various creative projects, ensuring that the synthesized speech closely resembles the original speaker's voice while allowing for creative adjustments in tone and tempo.
This parameter is the text you wish to synthesize using the cloned voice. It supports multiline input, allowing you to enter longer passages of text. The default text is a placeholder that explains the node's function. You can separate paragraphs with double line breaks to structure the output speech. This input is crucial as it defines the content of the synthesized speech.
The reference_audio parameter is an audio sample from which the voice will be cloned. This audio serves as the basis for capturing the unique vocal characteristics of the speaker, such as tone, accent, and style. Providing a clear and high-quality audio sample will significantly enhance the accuracy of the voice cloning process.
This parameter requires the exact text spoken in the reference audio. By providing this text, you help the model understand the speaker's pronunciation patterns, which significantly improves the quality of the voice cloning. It is especially important for capturing nuances in speech and ensuring that the synthesized voice closely matches the original.
The pitch parameter allows you to adjust the pitch of the synthesized voice. You can choose from options like "very_low," "low," "moderate," "high," and "very_high," with "moderate" being the default. Adjusting the pitch can help match the emotional tone or artistic style you are aiming for in your project.
This parameter controls the speed of the synthesized speech. Similar to pitch, you can select from "very_low," "low," "moderate," "high," and "very_high," with "moderate" as the default. Modifying the speed can be useful for creating different pacing effects, such as a slow, dramatic narration or a fast-paced, energetic delivery.
The max_tokens parameter determines the maximum length of the generated speech in terms of tokens. It ranges from 500 to 5000, with a default value of 3000. Higher values allow for longer text synthesis but require more memory. If you encounter out-of-memory errors, consider reducing this value. Conversely, increase it for very long texts to ensure the entire content is synthesized.
The synthesized_audio output is the final audio file generated by the node, containing the text-to-speech synthesis based on the input parameters. This audio reflects the cloned voice characteristics, adjusted pitch, and speed settings, providing a customized and high-quality speech output that can be used in various creative projects.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.