logo
RunComfy
  • Playground
  • ComfyUI
  • TrainerNew
  • API
  • Pricing
discord logo
ComfyUI>Workflows>Wan 2.2 Animate | Character Swap & Lip-Sync

Wan 2.2 Animate | Character Swap & Lip-Sync

Workflow Name: RunComfy/Wan-2.2-Animate-Swap-Lip-Sync
Workflow ID: 0000...1307
With this ComfyUI setup, you can transform actors or avatars by automatically matching their expressions, speech, and gestures. It uses pose tracking and face detection to animate replacement characters with synced lip movement and identical body motion. Perfect for video designers wanting realistic swaps and natural animation. You can easily customize speech synchronization and visual accuracy. The process is fully automated yet offers control for nuanced performance tuning. Achieve studio-grade animation quality from single reference clips in minutes. Note: The Medium machine supports input videos of about 10 seconds. If you want to generate longer videos, it’s recommended to use a 2XL or larger machine.

Workflow Tutorial

Wan 2.2 Animate: Swap Characters & Lip-Sync

Swap any on-camera speaker into your own character while keeping motion, expressions, and mouth shapes aligned to the original audio. This ComfyUI workflow, built around Wan 2.2 Animate: Swap Characters & Lip-Sync, detects body pose and face frames from an input video, retargets them to a single reference image, and renders a coherent, speech‑synchronous result.

The workflow suits editors, creators, and researchers who want reliable character replacement for interviews, reels, VTubing, slides, or dubbed shorts. Provide a source clip and one clean reference image; the pipeline recreates pose and lip articulation on the new character and muxes the original soundtrack into the final export.

Note: The Medium machine supports input videos of about 10 seconds. If you want to generate longer videos, it’s recommended to use a 2XL or larger machine.

Key models in Comfyui Wan 2.2 Animate: Swap Characters & Lip-Sync workflow

  • Wan 2.2 Animate 14B (FP8 scaled): the core video generator that synthesizes the retargeted character across frames using pose, face, and context signals. Model hub
  • Wan 2.1 VAE (bf16): encodes/decodes video latents used by Wan during sampling and output. Weights
  • UMT5‑XXL Text Encoder (bf16): builds text embeddings for light prompting or shot descriptors. Weights
  • CLIP Vision H: extracts robust image features from the reference portrait to preserve identity. Weights
  • Lightx2v I2V 14B LoRA: improves image‑to‑video stability and fidelity when driving with reference frames. LoRA
  • Wan22 Relight LoRA: helps keep consistent shading and relighting across the shot. LoRA
  • YOLOv10m (ONNX): fast person/face detection used before pose estimation. Model
  • ViTPose WholeBody Large (ONNX): high‑quality skeletal keypoints for full‑body motion transfer. Model
  • Segment Anything 2.1: segmentation for clean foreground masks that guide replacement. Repo

How to use Comfyui Wan 2.2 Animate: Swap Characters & Lip-Sync workflow

The graph moves through seven groups: load inputs, build a reference, preprocess pose/face and masks, load generation models, run character replacement, preview diagnostics, then export with audio.

Load video

Import your source clip with VHS_LoadVideo (#63). The node exposes optional width/height for resizing and outputs video frames, audio, and frame count for downstream use. Keep the clip trimmed near the speaking part if you want faster processing. The audio is passed through to the exporter so the final video stays aligned with the original soundtrack.

Reference image

Provide a single, clean portrait of the target character. The image is resized with ImageResizeKJv2 (#64) to match your working resolution and stored as the canonical reference used by CLIP Vision and the generator. Favor a sharp, forward‑facing image under lighting similar to your source shot to reduce color and shading drift.

Preprocessing

OnnxDetectionModelLoader (#178) loads YOLO and ViTPose, then PoseAndFaceDetection (#172) analyzes each frame to produce full‑body keypoints and per‑frame face crops. Sam2Segmentation (#104) creates a foreground mask using either detected bounding boxes or keyframe points; if one hint fails, switch to the other for better separation. The mask is refined with GrowMaskWithBlur (#182) and blockified with BlockifyMask (#108) to give the generator a stable, unambiguous subject region. Optional overlays (DrawViTPose (#173) and DrawMaskOnImage (#99)) help you visually verify pose coverage and mask quality before generation.

Models

WanVideoModelLoader (#22) loads Wan 2.2 Animate 14B, and WanVideoVAELoader (#38) provides the VAE. Identity features from the reference portrait are encoded by CLIPVisionLoader (#71) and WanVideoClipVisionEncode (#70). Style and stability are tuned with WanVideoLoraSelectMulti (#171), while WanVideoSetLoRAs (#48) and WanVideoSetBlockSwap (#50) apply LoRAs and block‑swap settings to the model; these tools come from the Wan wrapper library. See ComfyUI‑WanVideoWrapper for implementation details.

Character replacement

WanVideoTextEncodeCached (#65) accepts a short descriptive prompt if you want to nudge appearance or shot mood. WanVideoAnimateEmbeds (#62) fuses the reference image, per‑frame pose, face crops, background, and mask into image embeddings that preserve identity while matching motion and mouth shapes. WanVideoSampler (#27) then renders the frames; its scheduler and steps control the sharpness‑motion tradeoff. The decoded frames from WanVideoDecode (#28) are handed to size/count inspectors so you can confirm dimensions before export.

Result collage

For quick QA, the workflow concatenates the key inputs with ImageConcatMulti (#77, #66) to form a simple comparison strip of the reference, face crops, pose visualization, and a raw frame. Use it to sanity‑check identity cues and mouth shapes right after a test pass.

Output

VHS_VideoCombine (#30) produces the final video and muxes the original audio for perfect timing. Additional exporters are included so you can save intermediate diagnostics or alt cuts if needed. For best results on longer clips, export a short test first, then iterate on LoRA mixes and masks before committing to a full render.

Key nodes in Comfyui Wan 2.2 Animate: Swap Characters & Lip-Sync workflow

VHS_LoadVideo (#63)
Loads frames and the original audio in one step. Use it to set a working resolution that fits your GPU budget and to confirm the frame count that downstream nodes will consume. From ComfyUI‑VideoHelperSuite.

PoseAndFaceDetection (#172)
Runs YOLO and ViTPose to extract person boxes, whole‑body keypoints, and per‑frame face crops. Good keypoints are the backbone of believable motion transfer and are directly reused for lip articulation. From ComfyUI‑WanAnimatePreprocess.

Sam2Segmentation (#104)
Builds a foreground mask around the subject using either bounding boxes or keyframe point hints. If hair or hands are missed, switch hint type or expand the blur/grow settings before blockifying. From ComfyUI‑segment‑anything‑2.

WanVideoLoraSelectMulti (#171)
Lets you mix LoRAs such as Lightx2v and Wan22 Relight to balance motion stability, lighting consistency, and identity strength. Increase a LoRA’s weight for more influence, but watch for over‑stylization on faces. From ComfyUI‑WanVideoWrapper.

WanVideoAnimateEmbeds (#62)
Combines the reference portrait, pose images, face crops, background frames, and mask into a compact representation that conditions Wan 2.2 Animate. Ensure width, height, and num_frames match your intended export to avoid resampling artifacts. From ComfyUI‑WanVideoWrapper.

WanVideoSampler (#27)
Generates the final frames. Use higher steps and a steadier scheduler when you need crisper details, or a lighter schedule for fast previews. For very long clips, you can optionally introduce context‑window controls by wiring in WanVideoContextOptions (#110) to maintain temporal consistency across windows.

VHS_VideoCombine (#30)
Exports the finished video and muxes the original audio so lip movements remain in sync. The trim‑to‑audio option keeps duration aligned with the soundtrack. From ComfyUI‑VideoHelperSuite.

Optional extras

  • Use a sharp, front‑facing reference with neutral lips for the cleanest identity transfer; avoid heavy makeup or occlusions.
  • If segmentation misses hair or accessories, try switching Sam2Segmentation hints between bounding boxes and keyframe points, then slightly grow the mask before blockifying.
  • Lightx2v LoRA improves I2V steadiness; Wan22 Relight LoRA helps match inconsistent lighting. Small weight changes can resolve flicker without over‑baking a look.
  • Block‑swap can reduce identity drift on long shots; if faces soften over time, enable it in WanVideoSetBlockSwap (#50) and retest.
  • Keep working resolution proportional to the source to prevent aspect distortion; upsize only when the reference image is detailed enough to support it.
  • For capable runtimes, enabling torch compile and efficient attention in the wrapper nodes can speed up sampling; see ComfyUI‑WanVideoWrapper for guidance.

This Wan 2.2 Animate: Swap Characters & Lip-Sync workflow delivers consistent motion transfer and speech‑synchronous mouth shapes with minimal setup, making high‑quality character swaps fast and repeatable inside ComfyUI.

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge @MDMZ for building the whole workflow, Kijai for WAN 2.2 Animate and related ComfyUI nodes, Wan-AI for Wan2.2-Animate assets including YOLOv10m detection, and Comfy-Org for the Wan 2.1 Clip Vision model for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

  • Workflow Tutorial
    • Youtube: ComfyUI-Tutorial from @MDMZ

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Wan 2.2 Lightning T2V I2V | 4-Step Ultra Fast

Wan 2.2 now 20x faster! T2V + I2V in 4 steps.

Wan 2.2 + Lightx2v V2 | Ultra Fast T2V

Dual Light LoRA setup reduces steps to 8, 4x faster.

Wan 2.2 FLF2V | First-Last Frame Video Generation

Generate smooth videos from a start and end frame using Wan 2.2 FLF2V.

Wan 2.2 | Open-Source Video Gen Leader

Available now! Better precision + smoother motion.

EchoMimic | Audio-driven Portrait Animations

Generate realistic talking heads and body gestures synced with the provided audio.

FLUX Kontext Dev | Intelligent Image Editing

FLUX Kontext Dev | Intelligent Image Editing

Kontext Dev = Controllable + All Graphic Design Needs in One Tool

InstantID | Face to Sticker

InstantID | Face to Sticker

Utilize Instant ID and IPAdapter to create customizable, amazing face stickers.

Flux & 10 In-Context LoRA Models

Flux & 10 In-Context LoRA Models

Discover Flux and 10 versatile In-Context LoRA models for image generation.

Follow us
  • LinkedIn
  • Facebook
  • Instagram
  • Twitter
Support
  • Discord
  • Email
  • System Status
  • Affiliate
Resources
  • Free ComfyUI Online
  • ComfyUI Guides
  • RunComfy API
  • ComfyUI Tutorials
  • ComfyUI Nodes
  • Learn More
Legal
  • Terms of Service
  • Privacy Policy
  • Cookie Policy
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.