Visit ComfyUI Online for ready-to-use ComfyUI environment
Generate multi-speaker speech with dynamic voices for immersive audio experiences in AI projects.
The VibeVoiceMultipleSpeakersNode is designed to generate multi-speaker speech from text using the VibeVoice system. This node is particularly useful for creating dynamic and engaging audio content where multiple speakers are involved, such as dialogues or interviews. By leveraging advanced speech synthesis models, it allows you to assign different voices to different speakers within a text, providing a rich and immersive auditory experience. The node supports up to four speakers, and you can either provide actual voice samples for each speaker or let the system generate synthetic voices. This flexibility makes it a powerful tool for AI artists looking to add realistic and varied vocal elements to their projects.
This parameter accepts a string input that contains the text to be converted into speech, with speaker labels in the format [N]: where N is a number from 1 to 4. This allows you to specify which part of the text is spoken by which speaker. The text can be multiline, and the default example provided is [1]: Hello, this is the first speaker.\n[2]: Hi there, I'm the second speaker.\n[1]: Nice to meet you!\n[2]: Nice to meet you too!. This parameter is crucial for defining the dialogue structure and ensuring that the correct voice is assigned to each part of the text.
This parameter allows you to select a speech synthesis model from the available options in the ComfyUI/models/vibevoice/ folder. The default model is recommended, but you can choose others if available. The tooltip suggests that a large model is recommended for multi-speaker scenarios to ensure high-quality output. The choice of model can significantly impact the quality and characteristics of the generated speech.
This parameter specifies the type of attention mechanism to be used during speech synthesis. Options include auto, eager, sdpa, flash_attention_2, and sage. The default is auto, which automatically selects the best available option. Each type has its own advantages, such as sdpa being optimized for PyTorch and flash_attention_2 requiring a compatible GPU. The choice of attention type can affect the performance and speed of the node.
This optional parameter allows you to provide an audio sample for Speaker 1. If not provided, a synthetic voice will be used. This parameter is useful if you want to maintain consistency with a specific voice or if you have a preferred voice sample for the speaker.
Similar to speaker1_voice, this optional parameter allows you to provide an audio sample for Speaker 2. If not provided, a synthetic voice will be used. This helps in customizing the voice for the second speaker in your text.
This optional parameter allows you to provide an audio sample for Speaker 3. If not provided, a synthetic voice will be used. This is useful for dialogues involving three speakers, allowing for more personalized voice assignments.
This optional parameter allows you to provide an audio sample for Speaker 4. If not provided, a synthetic voice will be used. This parameter is essential for scenarios involving four speakers, ensuring each has a distinct voice.
This optional parameter accepts a LoRA configuration from the VibeVoice LoRA node. It allows for additional customization and fine-tuning of the speech synthesis process, potentially enhancing the quality and characteristics of the generated voices.
This parameter is a float value used when sampling is enabled, with a default of 0.95. It ranges from 0.1 to 2.0, with a step of 0.05. The temperature controls the randomness of the speech generation, with lower values resulting in more deterministic outputs and higher values allowing for more variation.
This parameter is a float value used when sampling is enabled, with a default of 0.95. It ranges from 0.1 to 1.0, with a step of 0.05. The top-p parameter, also known as nucleus sampling, controls the diversity of the generated speech by limiting the sampling to a subset of the most probable outputs.
This parameter is a float value that adjusts the speed of the generated speech, with a default of 1.0. It ranges from 0.8 to 1.2, with a step of 0.01. A value of 1.0 represents normal speed, values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. This parameter applies to all speakers and can be used to match the desired pacing of the dialogue.
The output parameter voice_samples is a list of audio samples generated for each speaker in the text. Each sample corresponds to a segment of the text spoken by a specific speaker, as defined by the speaker labels. This output is crucial for obtaining the final multi-speaker audio, allowing you to integrate it into your projects or further process it as needed.
voice_speed_factor to match the pacing of your project, especially if the dialogue needs to fit a specific timing.<number> speakers but <number> voice samples!<number>, got <number>RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.