Chatterbox TTS 📢:
ChatterboxTTS is a sophisticated text-to-speech (TTS) node designed to convert written text into natural-sounding speech. It leverages advanced machine learning models to generate high-quality audio outputs that can mimic human speech with remarkable accuracy. The node is particularly beneficial for applications requiring dynamic voice synthesis, such as virtual assistants, audiobooks, and interactive media. By utilizing a combination of voice encoding, text tokenization, and speech generation techniques, ChatterboxTTS can produce speech that reflects various emotional tones and speaker characteristics. This flexibility allows users to create personalized and contextually appropriate audio content. The node also includes features like conditional generation and watermarking to ensure the integrity and authenticity of the generated audio.
Chatterbox TTS 📢 Input Parameters:
text
The text parameter is the primary input for the ChatterboxTTS node, representing the written content you wish to convert into speech. This parameter accepts a string of text, which the node processes to generate corresponding audio. The quality and clarity of the output speech are directly influenced by the input text, so it's important to ensure that the text is well-structured and free of errors. There are no strict minimum or maximum length constraints, but excessively long texts may require more processing time.
repetition_penalty
The repetition_penalty parameter helps control the tendency of the model to repeat phrases or words in the generated speech. A value greater than 1.0 discourages repetition, while a value less than 1.0 encourages it. The default value is 1.2, which generally provides a good balance for most applications.
min_p
The min_p parameter sets a threshold for the probability of selecting tokens during speech generation. It helps filter out less likely token sequences, ensuring more coherent and natural-sounding speech. The default value is 0.05, with a range typically between 0.0 and 1.0.
top_p
The top_p parameter, also known as nucleus sampling, determines the cumulative probability threshold for token selection. It allows the model to consider only the most probable tokens, enhancing the quality of the generated speech. The default value is 1.0, which means all tokens are considered.
audio_prompt_path
The audio_prompt_path parameter allows you to specify a path to an audio file that serves as a reference for the desired voice characteristics in the generated speech. This can be particularly useful for creating speech that matches a specific speaker's voice or style. If not provided, the node uses default voice settings.
exaggeration
The exaggeration parameter adjusts the emotional intensity of the generated speech. A higher value results in more pronounced emotional expression, while a lower value produces more neutral speech. The default value is 0.5, providing a balanced emotional tone.
cfg_weight
The cfg_weight parameter influences the strength of the conditional generation features, allowing you to control how closely the generated speech adheres to the specified conditions. The default value is 0.5, offering a moderate level of adherence.
temperature
The temperature parameter controls the randomness of the speech generation process. A higher temperature value results in more varied and creative outputs, while a lower value produces more deterministic and consistent speech. The default value is 0.8.
pbar
The pbar parameter is used to display a progress bar during the speech generation process. This can be helpful for monitoring the progress of longer text inputs. It is optional and typically used in interactive environments.
max_new_tokens
The max_new_tokens parameter sets the maximum number of tokens that can be generated for the speech output. This limits the length of the generated audio, ensuring it remains within a manageable duration. The default value is 1000 tokens.
flow_cfg_scale
The flow_cfg_scale parameter adjusts the scale of the flow configuration used in the speech generation process. It influences the smoothness and coherence of the generated audio. The default value is 0.7, providing a good balance for most use cases.
Chatterbox TTS 📢 Output Parameters:
waveform
The waveform output parameter contains the generated audio data in the form of a waveform tensor. This tensor represents the synthesized speech corresponding to the input text, ready for playback or further processing. The waveform is typically accompanied by a sample rate, ensuring compatibility with standard audio playback systems.
sample_rate
The sample_rate output parameter specifies the sample rate of the generated audio waveform. This value is crucial for ensuring that the audio is played back at the correct speed and quality. The sample rate is typically set to match the capabilities of the audio playback system or the requirements of the application.
Chatterbox TTS 📢 Usage Tips:
- To achieve the best results, ensure that your input text is clear and well-structured, as this directly impacts the quality of the generated speech.
- Experiment with the
temperatureandtop_pparameters to find the right balance between creativity and coherence in the generated speech. - Use the
audio_prompt_pathparameter to match the voice characteristics of a specific speaker, enhancing the personalization of the generated audio.
Chatterbox TTS 📢 Common Errors and Solutions:
Error during TTS generation
- Explanation: This error occurs when there is an issue during the text-to-speech generation process, possibly due to incorrect input parameters or model configuration.
- Solution: Check the input parameters for any inconsistencies or errors. Ensure that the audio prompt path, if used, is valid and accessible. Review the model configuration and ensure that all necessary files and dependencies are correctly loaded.
Please prepare_conditionals first or specify audio_prompt_path
- Explanation: This error indicates that the node requires conditional preparation or an audio prompt path to proceed with the speech generation.
- Solution: Either prepare the necessary conditionals using the appropriate method or provide a valid audio prompt path to guide the speech generation process.
