Swap any on-camera speaker into your own character while keeping motion, expressions, and mouth shapes aligned to the original audio. This ComfyUI workflow, built around Wan 2.2 Animate: Swap Characters & Lip-Sync, detects body pose and face frames from an input video, retargets them to a single reference image, and renders a coherent, speech‑synchronous result.
The workflow suits editors, creators, and researchers who want reliable character replacement for interviews, reels, VTubing, slides, or dubbed shorts. Provide a source clip and one clean reference image; the pipeline recreates pose and lip articulation on the new character and muxes the original soundtrack into the final export.
Note: The Medium machine supports input videos of about 10 seconds. If you want to generate longer videos, it’s recommended to use a 2XL or larger machine.
The graph moves through seven groups: load inputs, build a reference, preprocess pose/face and masks, load generation models, run character replacement, preview diagnostics, then export with audio.
Import your source clip with VHS_LoadVideo (#63). The node exposes optional width/height for resizing and outputs video frames, audio, and frame count for downstream use. Keep the clip trimmed near the speaking part if you want faster processing. The audio is passed through to the exporter so the final video stays aligned with the original soundtrack.
Provide a single, clean portrait of the target character. The image is resized with ImageResizeKJv2 (#64) to match your working resolution and stored as the canonical reference used by CLIP Vision and the generator. Favor a sharp, forward‑facing image under lighting similar to your source shot to reduce color and shading drift.
OnnxDetectionModelLoader (#178) loads YOLO and ViTPose, then PoseAndFaceDetection (#172) analyzes each frame to produce full‑body keypoints and per‑frame face crops. Sam2Segmentation (#104) creates a foreground mask using either detected bounding boxes or keyframe points; if one hint fails, switch to the other for better separation. The mask is refined with GrowMaskWithBlur (#182) and blockified with BlockifyMask (#108) to give the generator a stable, unambiguous subject region. Optional overlays (DrawViTPose (#173) and DrawMaskOnImage (#99)) help you visually verify pose coverage and mask quality before generation.
WanVideoModelLoader (#22) loads Wan 2.2 Animate 14B, and WanVideoVAELoader (#38) provides the VAE. Identity features from the reference portrait are encoded by CLIPVisionLoader (#71) and WanVideoClipVisionEncode (#70). Style and stability are tuned with WanVideoLoraSelectMulti (#171), while WanVideoSetLoRAs (#48) and WanVideoSetBlockSwap (#50) apply LoRAs and block‑swap settings to the model; these tools come from the Wan wrapper library. See ComfyUI‑WanVideoWrapper for implementation details.
WanVideoTextEncodeCached (#65) accepts a short descriptive prompt if you want to nudge appearance or shot mood. WanVideoAnimateEmbeds (#62) fuses the reference image, per‑frame pose, face crops, background, and mask into image embeddings that preserve identity while matching motion and mouth shapes. WanVideoSampler (#27) then renders the frames; its scheduler and steps control the sharpness‑motion tradeoff. The decoded frames from WanVideoDecode (#28) are handed to size/count inspectors so you can confirm dimensions before export.
For quick QA, the workflow concatenates the key inputs with ImageConcatMulti (#77, #66) to form a simple comparison strip of the reference, face crops, pose visualization, and a raw frame. Use it to sanity‑check identity cues and mouth shapes right after a test pass.
VHS_VideoCombine (#30) produces the final video and muxes the original audio for perfect timing. Additional exporters are included so you can save intermediate diagnostics or alt cuts if needed. For best results on longer clips, export a short test first, then iterate on LoRA mixes and masks before committing to a full render.
VHS_LoadVideo (#63)
Loads frames and the original audio in one step. Use it to set a working resolution that fits your GPU budget and to confirm the frame count that downstream nodes will consume. From ComfyUI‑VideoHelperSuite.
PoseAndFaceDetection (#172)
Runs YOLO and ViTPose to extract person boxes, whole‑body keypoints, and per‑frame face crops. Good keypoints are the backbone of believable motion transfer and are directly reused for lip articulation. From ComfyUI‑WanAnimatePreprocess.
Sam2Segmentation (#104)
Builds a foreground mask around the subject using either bounding boxes or keyframe point hints. If hair or hands are missed, switch hint type or expand the blur/grow settings before blockifying. From ComfyUI‑segment‑anything‑2.
WanVideoLoraSelectMulti (#171)
Lets you mix LoRAs such as Lightx2v and Wan22 Relight to balance motion stability, lighting consistency, and identity strength. Increase a LoRA’s weight for more influence, but watch for over‑stylization on faces. From ComfyUI‑WanVideoWrapper.
WanVideoAnimateEmbeds (#62)
Combines the reference portrait, pose images, face crops, background frames, and mask into a compact representation that conditions Wan 2.2 Animate. Ensure width, height, and num_frames match your intended export to avoid resampling artifacts. From ComfyUI‑WanVideoWrapper.
WanVideoSampler (#27)
Generates the final frames. Use higher steps and a steadier scheduler when you need crisper details, or a lighter schedule for fast previews. For very long clips, you can optionally introduce context‑window controls by wiring in WanVideoContextOptions (#110) to maintain temporal consistency across windows.
VHS_VideoCombine (#30)
Exports the finished video and muxes the original audio so lip movements remain in sync. The trim‑to‑audio option keeps duration aligned with the soundtrack. From ComfyUI‑VideoHelperSuite.
Sam2Segmentation hints between bounding boxes and keyframe points, then slightly grow the mask before blockifying.WanVideoSetBlockSwap (#50) and retest.This Wan 2.2 Animate: Swap Characters & Lip-Sync workflow delivers consistent motion transfer and speech‑synchronous mouth shapes with minimal setup, making high‑quality character swaps fast and repeatable inside ComfyUI.
This workflow implements and builds upon the following works and resources. We gratefully acknowledge @MDMZ for building the whole workflow, Kijai for WAN 2.2 Animate and related ComfyUI nodes, Wan-AI for Wan2.2-Animate assets including YOLOv10m detection, and Comfy-Org for the Wan 2.1 Clip Vision model for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.
Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.