RunComfy

Wan 2.2 Animate | Character Swap & Lip-Sync

Transforms any face to speak and move like the original with ease.

FramePack Wrapper | Efficient long Video Generation

Create stable, 60s+ long videos with minimal cloud resources.

Wan 2.2 VACE | Pose-Controlled Video Generator

Turn still images into stunning motion with pose-based control.

FLUX.2 [klein] 4B & 9B | Ultra-Fast Flux Image Generator

Blazing-fast visual creation with unified editing control.

ComfyUI > Nodes > VibeVoice ComfyUI > VibeVoice Multiple Speakers

ComfyUI Node: VibeVoice Multiple Speakers

Class Name

VibeVoiceMultipleSpeakersNode

Category
VibeVoiceWrapper

Author
Fabio Sarracino (Account age: 110days) Extension
VibeVoice ComfyUI Latest Updated
2025-10-02 Github Stars
1.25K

Github Ask Fabio Sarracino Current Questions Past Questions

Table of Content

Description
VibeVoiceMultipleSpeakersNode:
VibeVoiceMultipleSpeakersNode Input Parameters:
VibeVoiceMultipleSpeakersNode Output Parameters:
VibeVoiceMultipleSpeakersNode Usage Tips:
VibeVoiceMultipleSpeakersNode Common Errors and Solutions:
Related Nodes

How to Install VibeVoice ComfyUI

Install this extension via the ComfyUI Manager by searching for VibeVoice ComfyUI

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter VibeVoice ComfyUI in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

VibeVoice Multiple Speakers Description

Generate multi-speaker speech with dynamic voices for immersive audio experiences in AI projects.

VibeVoice Multiple Speakers:

The VibeVoiceMultipleSpeakersNode is designed to generate multi-speaker speech from text using the VibeVoice system. This node is particularly useful for creating dynamic and engaging audio content where multiple speakers are involved, such as dialogues or interviews. By leveraging advanced speech synthesis models, it allows you to assign different voices to different speakers within a text, providing a rich and immersive auditory experience. The node supports up to four speakers, and you can either provide actual voice samples for each speaker or let the system generate synthetic voices. This flexibility makes it a powerful tool for AI artists looking to add realistic and varied vocal elements to their projects.

VibeVoice Multiple Speakers Input Parameters:

text

This parameter accepts a string input that contains the text to be converted into speech, with speaker labels in the format [N]: where N is a number from 1 to 4. This allows you to specify which part of the text is spoken by which speaker. The text can be multiline, and the default example provided is [1]: Hello, this is the first speaker.\n[2]: Hi there, I'm the second speaker.\n[1]: Nice to meet you!\n[2]: Nice to meet you too!. This parameter is crucial for defining the dialogue structure and ensuring that the correct voice is assigned to each part of the text.

model

This parameter allows you to select a speech synthesis model from the available options in the ComfyUI/models/vibevoice/ folder. The default model is recommended, but you can choose others if available. The tooltip suggests that a large model is recommended for multi-speaker scenarios to ensure high-quality output. The choice of model can significantly impact the quality and characteristics of the generated speech.

attention_type

This parameter specifies the type of attention mechanism to be used during speech synthesis. Options include auto, eager, sdpa, flash_attention_2, and sage. The default is auto, which automatically selects the best available option. Each type has its own advantages, such as sdpa being optimized for PyTorch and flash_attention_2 requiring a compatible GPU. The choice of attention type can affect the performance and speed of the node.

speaker1_voice

This optional parameter allows you to provide an audio sample for Speaker 1. If not provided, a synthetic voice will be used. This parameter is useful if you want to maintain consistency with a specific voice or if you have a preferred voice sample for the speaker.

speaker2_voice

Similar to speaker1_voice, this optional parameter allows you to provide an audio sample for Speaker 2. If not provided, a synthetic voice will be used. This helps in customizing the voice for the second speaker in your text.

speaker3_voice

This optional parameter allows you to provide an audio sample for Speaker 3. If not provided, a synthetic voice will be used. This is useful for dialogues involving three speakers, allowing for more personalized voice assignments.

speaker4_voice

This optional parameter allows you to provide an audio sample for Speaker 4. If not provided, a synthetic voice will be used. This parameter is essential for scenarios involving four speakers, ensuring each has a distinct voice.

lora

This optional parameter accepts a LoRA configuration from the VibeVoice LoRA node. It allows for additional customization and fine-tuning of the speech synthesis process, potentially enhancing the quality and characteristics of the generated voices.

temperature

This parameter is a float value used when sampling is enabled, with a default of 0.95. It ranges from 0.1 to 2.0, with a step of 0.05. The temperature controls the randomness of the speech generation, with lower values resulting in more deterministic outputs and higher values allowing for more variation.

top_p

This parameter is a float value used when sampling is enabled, with a default of 0.95. It ranges from 0.1 to 1.0, with a step of 0.05. The top-p parameter, also known as nucleus sampling, controls the diversity of the generated speech by limiting the sampling to a subset of the most probable outputs.

voice_speed_factor

This parameter is a float value that adjusts the speed of the generated speech, with a default of 1.0. It ranges from 0.8 to 1.2, with a step of 0.01. A value of 1.0 represents normal speed, values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. This parameter applies to all speakers and can be used to match the desired pacing of the dialogue.

VibeVoice Multiple Speakers Output Parameters:

voice_samples

The output parameter voice_samples is a list of audio samples generated for each speaker in the text. Each sample corresponds to a segment of the text spoken by a specific speaker, as defined by the speaker labels. This output is crucial for obtaining the final multi-speaker audio, allowing you to integrate it into your projects or further process it as needed.

VibeVoice Multiple Speakers Usage Tips:

Ensure that the text input is properly formatted with speaker labels to avoid mismatches in voice assignment.
Experiment with different models and attention types to find the best combination for your specific use case.
Use the optional voice sample parameters to maintain consistency with specific voices or to achieve a desired vocal quality.
Adjust the voice_speed_factor to match the pacing of your project, especially if the dialogue needs to fit a specific timing.

VibeVoice Multiple Speakers Common Errors and Solutions:

Mismatch: `<number>` speakers but `<number>` voice samples!

Explanation: This error occurs when the number of speakers detected in the text does not match the number of provided voice samples.
Solution: Ensure that the number of voice samples provided matches the number of speakers indicated in the text. If you are not providing voice samples, make sure the system can generate synthetic voices for all speakers.

Voice sample count mismatch: expected `<number>`, got `<number>`

Explanation: This error indicates a discrepancy between the expected number of voice samples and the actual number provided or generated.
Solution: Verify that the text input correctly labels all speakers and that the node is configured to generate or use the appropriate number of voice samples.

VibeVoice Multiple Speakers Related Nodes

Go back to the extension to check out more related nodes.

VibeVoice ComfyUI

Table of Content

Description
VibeVoiceMultipleSpeakersNode:
VibeVoiceMultipleSpeakersNode Input Parameters:
VibeVoiceMultipleSpeakersNode Output Parameters:
VibeVoiceMultipleSpeakersNode Usage Tips:
VibeVoiceMultipleSpeakersNode Common Errors and Solutions:
Related Nodes

ComfyUI Trellis2 | Image-to-3D Mesh Generation Workflow

Convert images into structured, editable 3D meshes with precise geometry and topology control.

Wan2.2 Fun Inp | Cinematic Video Generator

From 2 images to stunning videos with smooth, controllable transitions.

FLUX Kontext OmniConsistency LoRA

22 unique styles, perfect consistency, clean results, all done faster.

ACE++ Face Swap ｜ Image Editing

Swap faces in images with natural language instructions while preserving style and context.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

Support

Resources

Legal

RunComfy

ComfyUI Node: VibeVoice Multiple Speakers

VibeVoiceMultipleSpeakersNode

How to Install VibeVoice ComfyUI

VibeVoice Multiple Speakers Description

VibeVoice Multiple Speakers:

VibeVoice Multiple Speakers Input Parameters:

text

model

attention_type

speaker1_voice

speaker2_voice

speaker3_voice

speaker4_voice

lora

temperature

top_p

voice_speed_factor

VibeVoice Multiple Speakers Output Parameters:

voice_samples

VibeVoice Multiple Speakers Usage Tips:

VibeVoice Multiple Speakers Common Errors and Solutions:

Mismatch: <number> speakers but <number> voice samples!

Voice sample count mismatch: expected <number>, got <number>

VibeVoice Multiple Speakers Related Nodes

Mismatch: `<number>` speakers but `<number>` voice samples!

Voice sample count mismatch: expected `<number>`, got `<number>`