Visit ComfyUI Online for ready-to-use ComfyUI environment
Powerful node for realistic and expressive speech synthesis from text, customizable for various projects and devices.
KaniTTS is a powerful node designed to generate speech from text using the KaniTTS model. This node is particularly beneficial for AI artists and developers who wish to incorporate realistic and expressive speech synthesis into their projects. By leveraging advanced text-to-speech technology, KaniTTS can transform written text into natural-sounding audio, making it an essential tool for creating voiceovers, virtual assistants, and interactive media. The node supports various configurations, allowing users to customize the speech output to suit their specific needs, such as adjusting the randomness of the speech or selecting different speakers. Its ability to handle different devices and manage resources efficiently ensures smooth performance, even on systems with limited computational power.
This parameter allows you to select the specific KaniTTS model to use for speech generation. The available models may vary in their capabilities, such as support for different speakers. The default model is typically the 370m model, which supports speaker selection. Choosing the right model can impact the quality and characteristics of the generated speech.
The speaker parameter lets you choose a specific speaker's voice for the speech synthesis. This option is only applicable when using models that support multiple speakers, such as the 370m model. The default value is "None," which means no specific speaker is selected. Selecting a speaker can add a personalized touch to the generated audio.
This is the text input that you want to convert into speech. It supports multiline input, allowing you to synthesize longer passages of text. The default text is "Hello world! My name is Kani, I'm a speech generation model!" The text input must not be empty, as it is the primary content for speech generation.
The temperature parameter controls the randomness of the speech generation process. A higher temperature value results in more creative and varied speech, while a lower value produces more deterministic and consistent output. The default value is 1.4, with a range from 0.1 to 2.0, adjustable in steps of 0.05.
This parameter sets the nucleus sampling probability, which influences the diversity of the generated speech. A higher top_p value allows for more diverse outputs by considering a larger set of possible tokens. The default value is 0.95, with a range from 0.1 to 1.0, adjustable in steps of 0.05.
The repetition_penalty parameter applies a penalty to repeated tokens in the generated speech, helping to reduce redundancy and improve the naturalness of the output. The default value is 1.1, with a range from 1.0 to 2.0, adjustable in steps of 0.05.
This parameter defines the maximum number of audio tokens to generate, effectively setting the length of the synthesized speech. The default value is 1200, with a range from 100 to 2000, adjustable in steps of 50. Adjusting this parameter can help control the duration of the output audio.
The seed parameter is used for reproducibility, ensuring that the same input parameters produce the same output. A value of -1 indicates a random seed, while other values can be used to generate consistent results. The default value is -1, with a range from -1 to 0xFFFFFFFFFFFFFFFF.
This boolean parameter determines whether the KaniTTS model should be forcefully offloaded from VRAM after generation. The default setting is "Auto-Manage," which allows the system to manage resources automatically. Enabling "Force Offload" can be useful for freeing up VRAM in resource-constrained environments.
The device parameter specifies the hardware device to use for running the inference, such as "cuda" for GPU or "cpu" for CPU. The default device is determined by the system's available hardware. Selecting the appropriate device can significantly impact the performance and speed of the speech generation process.
This parameter sets the data type for the model's computations, affecting precision and performance. Supported types include "float16" and "float32," with "float16" being the default for MPS devices. Choosing the right dtype can optimize the balance between speed and accuracy.
The waveform output parameter contains the generated audio data in the form of a tensor. This tensor represents the synthesized speech waveform, which can be used for playback or further processing. The waveform is crucial for converting the text input into audible speech.
The sample_rate parameter indicates the audio sample rate of the generated waveform. This value is essential for ensuring that the audio is played back at the correct speed and quality. The sample rate is determined by the KaniTTS model's configuration and is typically set to a standard value for speech synthesis.
<model_name>'.<dtype>' for MPS, falling back to float16.RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.