Visit ComfyUI Online for ready-to-use ComfyUI environment
Convert written text to spoken dialogue with customizable voices and expressions for dynamic audio content creation.
The Dia text to speech node is a powerful tool designed to convert written text into spoken dialogue, leveraging an open weights model to provide users with full control over scripts and voices. This node is particularly beneficial for AI artists and developers who wish to integrate realistic and customizable speech synthesis into their projects. By utilizing advanced text-to-speech algorithms, the node allows for the generation of high-quality audio outputs that can include various vocal expressions and tags, such as laughter or sighs, to enhance the naturalness and expressiveness of the dialogue. The node's flexibility and ease of use make it an essential component for creating dynamic audio content, whether for artistic, educational, or entertainment purposes.
This parameter specifies the file path to the pre-trained model used for text-to-speech conversion. The default path is set to models/Dia/dia-v0_1.pth. It is crucial for loading the correct model weights necessary for generating speech.
The seed parameter is an integer that initializes the random number generator, ensuring reproducibility of the audio output. It has a default value of 12345, with a minimum of 0 and a maximum defined by MAX_SEED. Adjusting the seed can lead to variations in the generated speech.
This boolean parameter determines whether the generated audio should be saved as a file. By default, it is set to True, meaning the audio will be saved automatically.
This string parameter sets the prefix for the filename of the saved audio file. The default prefix is audio/dia, which helps in organizing and identifying the generated audio files.
The speech parameter is a multiline string input where you can specify the text to be converted into speech. It includes a default script with multiple speakers and expressions, allowing for a demonstration of the node's capabilities. This parameter is essential for defining the content of the audio output.
This float parameter controls the configuration scale, influencing the model's behavior during speech generation. It ranges from 0.0 to 10.0, with a default value of 3.0. Adjusting this scale can affect the creativity and variability of the generated speech.
The temperature parameter is a float that affects the randomness of the speech generation process. It has a range from 0.0 to 10.0, with a default value of 1.3. Higher values result in more diverse outputs, while lower values produce more deterministic results.
This float parameter, ranging from 0.0 to 10.0 with a default of 0.95, is used for nucleus sampling during speech generation. It determines the cumulative probability threshold for selecting the next word, balancing between diversity and coherence.
A boolean parameter that, when set to True (default), applies a configuration filter to the speech generation process, potentially improving the quality of the output.
This boolean parameter, defaulting to False, indicates whether to use Torch's compile feature for optimizing the model's performance. Enabling it may speed up the generation process but could increase the initial computation time.
An integer parameter that specifies the top-k filtering for the configuration filter, with a default value of 35 and a range from 0 to 100. It helps in refining the selection of words during speech generation.
This optional parameter allows you to provide an audio file as input, which can be used as a reference or prompt for the speech generation process.
A multiline string parameter for inputting the transcript of the provided audio. It is optional and can be used to align the generated speech with the input audio.
This parameter lists the available vocal expression tags, such as (laughs) or (sighs), that can be included in the speech text to enhance expressiveness. It is a multiline string with a default set of tags.
The primary output of the Dia text to speech node is the generated audio, which is an array representing the synthesized speech. This output is crucial for applications requiring realistic and expressive audio content, as it embodies the text input transformed into spoken dialogue.
seed values to explore variations in the generated speech and find the most suitable output for your project.available_tags to add expressive elements to your speech, making it more engaging and natural.temperature and top_p parameters to balance between creativity and coherence in the speech output, depending on the desired level of randomness.use_torch_compile for potentially faster performance, especially when generating longer audio sequences.model_path does not point to a valid file.model_path is correct and that the model file exists at the specified location.MAX_SEED and adjust it accordingly.filename_prefix exists and has write permissions.cfg_scale parameter is set outside its valid range.cfg_scale to be within the range of 0.0 to 10.0.use_torch_compile feature.RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.