Note:
This is the upgraded Multi-Person version of our ComfyUI MultiTalk workflow.
It now supports multi-person conversational video generation while still including the single-person mode from our previous version.
The workflow is ideal for social content, product explainers, character dialogue, and rapid previz. It pairs MultiTalk audio embeddings with video diffusion so the lips, jaw, and subtle facial cues follow speech. Use it as a drop-in path for both Meigen MultiTalk multi-speaker scenes or lean, single-speaker clips.
Wan 2.1 video diffusion model
Drives the core text- and image-conditioned video generation. It handles scene appearance, camera, and motion while accepting additional guidance for conversation dynamics.
Wav2Vec 2.0
Extracts robust speech representations that MultiTalk converts into talking-specific embeddings. Reference: .
MultiTalk (MeiGen-AI)
Research method for audio-driven multi-person conversational video. Reference implementation: .
ComfyUI Wan Video Wrapper
ComfyUI integration that exposes Wan 2.1 loading, encoders, and the video sampler, plus the MultiTalk embedding node. Reference: .
Index-TTS (optional)
Text-to-speech with voice reference for generating clean dialogue tracks inside the workflow. Reference: .
This workflow runs end-to-end: you prepare speakers and audio, set a short scene prompt, then render. It supports both multi-person and single-person setups. Groups in the graph keep things organized; the most important ones are described below.
Load identity images for your speaker faces and preview masks, then mux the final frames with audio. The LoadImage
nodes accept your portraits, while VHS_VideoCombine
assembles the rendered frames with the selected audio track to an MP4. You can scrub audio with PreviewAudio
during setup to confirm levels and duration.
Get_WanModel
, Get_WanTextEncoder
, and WanVideoModelLoader
initialize Wan 2.1 along with the text and VAE components. Think of this as the engine room: once loaded, the video sampler can accept image, text, and conversation embeddings. You rarely need to change anything here beyond ensuring the correct Wan weights are selected.
You can bring your own dialogue tracks or synthesize them:
LoadAudio
to import each speaker’s line. If a clip is mixed with music or noise, pass it through AudioSeparation
and route the clean Vocals
output forward.Speaker 1 - Text
and Speaker 2 - Text
with IndexTTSNode
to synthesize voices from typed lines, optionally giving reference_audio
for the desired timbre.MultiTalkWav2VecEmbeds
converts speech into MultiTalk embeddings that capture timing and articulation cues for each speaker. Feed it one audio stream for single-person, or two streams for multi-person dialogue. If your scene needs face-specific targeting, provide clean face masks as ref_target_masks
so each voice drives the correct person.
A short scene prompt via Prompt
and WanVideoTextEncodeSingle
sets the visual mood and environment. Keep prompts concise and descriptive (location, tone, lighting). The text encoder generates semantic guidance that Wan uses alongside identity and conversation signals.
The Uni3C group prepares global context embeddings that help stabilize identity, framing, and composition over time. The Resize group ensures source images and masks are scaled to model-friendly dimensions so the sampler receives consistent inputs.
WanVideoSampler
is where everything meets: identity image embeddings, text embeddings, and MultiTalk audio embeddings combine to produce the final frames. The downstream Sampling processing group applies any post steps needed for smoothness and consistency before handoff to the video combiner.
For multi-person clips, draw one mask per face in ComfyUI’s mask editor. Keep masks separated so they never touch. If you only provide one mask and one audio track, the workflow automatically behaves as a single-person MultiTalk setup.
MultiTalkWav2VecEmbeds
(#79/#162)Converts one or more dialogue tracks into MultiTalk conversation embeddings. Start with one audio input for single-person or two for multi-person; add masks when you need per-face routing. Adjust only what matters: number of frames to match your planned clip length, and whether to provide ref_target_masks
for precise speaker-to-face alignment.
AudioSeparation
(#88/#160/#161)Optional cleanup for noisy inputs. Route your noisy clip into this node and forward the Vocals
output. Use it when field recordings include background music or chatter; skip it if you already have clean voice tracks.
IndexTTSNode
(#163/#164)Turns Speaker 1 - Text
and Speaker 2 - Text
into dialogue audio. Provide a short reference_audio
to clone tone and pacing, then supply text lines. Keep sentences brief and natural for best lip timing in MultiTalk.
WanVideoTextEncodeSingle
(#18)Encodes your scene prompt for Wan 2.1. Favor simple, concrete descriptions of place, lighting, and style. Avoid long lists; one or two sentences are enough for the sampler to l
Original Research: MultiTalk is developed by MeiGen-AI with collaboration from leading researchers in the field. The original paper "Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation" presents the groundbreaking research behind this technology. ComfyUI Integration: The ComfyUI implementation is provided by Kijai through the ComfyUI-WanVideoWrapper repository, making this advanced technology accessible to the broader creative community.
Base Technology: Built upon the Wan2.1 video diffusion model and incorporates audio processing techniques from Wav2Vec, representing a synthesis of cutting-edge AI research.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.