ComfyUI>Workflows>Multitalk | Realistic Talking Video Maker

Multitalk | Realistic Talking Video Maker

Workflow Name: RunComfy/Multitalk

Workflow ID: 0000...1266

This workflow generates lip-synced videos from portraits and audio, supporting both single and multi-speaker outputs with detailed facial motion and speech alignment.

ComfyUI MultiTalk: Multi-Person and Single-Person Talking Video

Note:
This is the upgraded Multi-Person version of our ComfyUI MultiTalk Single Person workflow.
It now supports multi-person conversational video generation while still including the single-person mode from our previous version.

The workflow is ideal for social content, product explainers, character dialogue, and rapid previz. It pairs MultiTalk audio embeddings with video diffusion so the lips, jaw, and subtle facial cues follow speech. Use it as a drop-in path for both Meigen MultiTalk multi-speaker scenes or lean, single-speaker clips.

Key models in ComfyUI MultiTalk workflow

Wan 2.1 video diffusion model

Drives the core text- and image-conditioned video generation. It handles scene appearance, camera, and motion while accepting additional guidance for conversation dynamics.

Wav2Vec 2.0

Extracts robust speech representations that MultiTalk converts into talking-specific embeddings. Reference: facebook/wav2vec2-base-960h.

MultiTalk (MeiGen-AI)

Research method for audio-driven multi-person conversational video. Reference implementation: MeiGen-AI/MultiTalk.

ComfyUI Wan Video Wrapper

ComfyUI integration that exposes Wan 2.1 loading, encoders, and the video sampler, plus the MultiTalk embedding node. Reference: kijai/ComfyUI-WanVideoWrapper.

Index-TTS (optional)

Text-to-speech with voice reference for generating clean dialogue tracks inside the workflow. Reference: chenpipi0807/ComfyUI-Index-TTS.

How to use ComfyUI MultiTalk workflow

This workflow runs end-to-end: you prepare speakers and audio, set a short scene prompt, then render. It supports both multi-person and single-person setups. Groups in the graph keep things organized; the most important ones are described below.

Input & Output

Load identity images for your speaker faces and preview masks, then mux the final frames with audio. The LoadImage nodes accept your portraits, while VHS_VideoCombine assembles the rendered frames with the selected audio track to an MP4. You can scrub audio with PreviewAudio during setup to confirm levels and duration.

Model

Get_WanModel, Get_WanTextEncoder, and WanVideoModelLoader initialize Wan 2.1 along with the text and VAE components. Think of this as the engine room: once loaded, the video sampler can accept image, text, and conversation embeddings. You rarely need to change anything here beyond ensuring the correct Wan weights are selected.

Speaker Audio (two ways)

You can bring your own dialogue tracks or synthesize them:

Bring audio: Use LoadAudio to import each speaker’s line. If a clip is mixed with music or noise, pass it through AudioSeparation and route the clean Vocals output forward.
Generate audio: Use Speaker 1 - Text and Speaker 2 - Text with IndexTTSNode to synthesize voices from typed lines, optionally giving reference_audio for the desired timbre.

MultiTalk audio embeddings

MultiTalkWav2VecEmbeds converts speech into MultiTalk embeddings that capture timing and articulation cues for each speaker. Feed it one audio stream for single-person, or two streams for multi-person dialogue. If your scene needs face-specific targeting, provide clean face masks as ref_target_masks so each voice drives the correct person.

Prompting and text context

A short scene prompt via Prompt and WanVideoTextEncodeSingle sets the visual mood and environment. Keep prompts concise and descriptive (location, tone, lighting). The text encoder generates semantic guidance that Wan uses alongside identity and conversation signals.

Uni3C and Resize

The Uni3C group prepares global context embeddings that help stabilize identity, framing, and composition over time. The Resize group ensures source images and masks are scaled to model-friendly dimensions so the sampler receives consistent inputs.

KSampler and sampling processing

WanVideoSampler is where everything meets: identity image embeddings, text embeddings, and MultiTalk audio embeddings combine to produce the final frames. The downstream Sampling processing group applies any post steps needed for smoothness and consistency before handoff to the video combiner.

Masks for multi-person

For multi-person clips, draw one mask per face in ComfyUI’s mask editor. Keep masks separated so they never touch. If you only provide one mask and one audio track, the workflow automatically behaves as a single-person MultiTalk setup.

Key nodes in ComfyUI MultiTalk workflow

`MultiTalkWav2VecEmbeds` (#79/#162)

Converts one or more dialogue tracks into MultiTalk conversation embeddings. Start with one audio input for single-person or two for multi-person; add masks when you need per-face routing. Adjust only what matters: number of frames to match your planned clip length, and whether to provide ref_target_masks for precise speaker-to-face alignment.

`AudioSeparation` (#88/#160/#161)

Optional cleanup for noisy inputs. Route your noisy clip into this node and forward the Vocals output. Use it when field recordings include background music or chatter; skip it if you already have clean voice tracks.

`IndexTTSNode` (#163/#164)

Turns Speaker 1 - Text and Speaker 2 - Text into dialogue audio. Provide a short reference_audio to clone tone and pacing, then supply text lines. Keep sentences brief and natural for best lip timing in MultiTalk.

`WanVideoTextEncodeSingle` (#18)

Encodes your scene prompt for Wan 2.1. Favor simple, concrete descriptions of place, lighting, and style. Avoid long lists; one or two sentences are enough for the sampler to l

Acknowledgements

Original Research: MultiTalk is developed by MeiGen-AI with collaboration from leading researchers in the field. The original paper "Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation" presents the groundbreaking research behind this technology. ComfyUI Integration: The ComfyUI implementation is provided by Kijai through the ComfyUI-WanVideoWrapper repository, making this advanced technology accessible to the broader creative community.

Base Technology: Built upon the Wan2.1 video diffusion model and incorporates audio processing techniques from Wav2Vec, representing a synthesis of cutting-edge AI research.

Links and Resources

Original Research: MeiGen-AI MultiTalk Repository
Project Page: https://meigen-ai.github.io/multi-talk/
ComfyUI Integration: ComfyUI-WanVideoWrapper

Want More ComfyUI Workflows?

Hallo2 | Lip-Sync Portrait Animation

Audio-driven lip-sync for portrait animation in 4K.

Sonic | Lip-Sync Portrait Animation

Sonic delivers advanced audio-driven lip-sync for portraits with high-quality animation.

IPAdapter Plus (V2) | Change Clothes

Use IPAdapter Plus for your fashion model creation, easily changing outfits and styles

MultiTalk | Photo to Talking Video

Millisecond lip sync + Wan2.1 = 15s ultra-detailed talking videos!

EchoMimic | Audio-driven Portrait Animations

Generate realistic talking heads and body gestures synced with the provided audio.

Trellis | Image to 3D

Trellis is an advanced Image-to-3D model for high-quality 3D assets generation.

Flux Kontext Zoom Out ComfyUI Workflow | Seamless Outpainting

Zoom Out LoRA enlarges images seamlessly with natural continuation.

MimicMotion | Human Motion Video Generation

Generate high-quality human motion videos with MimicMotion, using a reference image and motion sequence.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.