ComfyUI>Workflows>Wan 2.2 Animate V2 | Realistic Pose Video Generator

Wan 2.2 Animate V2 | Realistic Pose Video Generator

Workflow Name: RunComfy/Wan-2-2-Animate-V2

Workflow ID: 0000...1300

This upgraded workflow lets you turn reference images and pose videos into realistic full-body animations. With improved realism and motion fluidity, it captures expressions and body dynamics precisely. Enhanced temporal consistency ensures cinematic results every time. Ideal for animators, storytellers, and content creators seeking natural movement. You can create dance scenes, performance renders, or realistic character clips efficiently. Experience smoother motion control and higher fidelity outputs than ever before.

Wan 2.2 Animate V2 pose‑driven video generation workflow for ComfyUI

Wan 2.2 Animate V2 is a pose‑driven video generation workflow that turns a single reference image plus a driving pose video into a lifelike, identity‑preserving animation. It builds on the first version with higher fidelity, smoother motion, and better temporal consistency, all while closely following full‑body movement and expressions from the source video.

This ComfyUI workflow is designed for creators who want fast, reliable results for character animation, dance clips, and performance‑driven storytelling. It combines robust pre‑processing (pose, face, and subject masking) with the Wan 2.2 model family and optional LoRAs, so you can dial in style, lighting, and background handling with confidence.

Key models in ComfyUI Wan 2.2 Animate V2 workflow

Wan 2.2 Animate 14B. Core video diffusion model that synthesizes temporally consistent frames from multimodal embeddings. Weights: Kijai/WanVideo_comfy_fp8_scaled (Wan22Animate).
Wan 2.1 VAE. Latent video decoder/encoder used by the Wan family to reconstruct RGB frames with minimal loss. Weights: Wan2_1_VAE_bf16.safetensors.
UMT5‑XXL text encoder. Encodes prompts that guide look, scene, and cinematics. Weights: umt5‑xxl‑enc‑bf16.safetensors.
CLIP Vision (ViT‑H/14). Extracts identity‑preserving features from the reference image. Paper: CLIP.
ViTPose Whole‑Body (ONNX). Estimates dense body keypoints that drive motion transfer. Models: ViTPose‑L WholeBody and ViTPose‑H WholeBody. Paper: ViTPose.
YOLOv10 detector. Supplies person boxes to stabilize pose detection and segmentation. Example: yolov10m.onnx.
Segment Anything 2. High‑quality subject masks for background preservation, compositing, or relighting previews. Repo: facebookresearch/segment-anything-2.
Optional LoRAs for style and light transport. Useful for relighting and texture detail in Wan 2.2 Animate V2 outputs. Examples: Lightx2v and Wan22_relight.

How to use ComfyUI Wan 2.2 Animate V2 workflow

At a high level, the pipeline extracts pose and face cues from the driving video, encodes identity from a single reference image, optionally isolates the subject with a SAM 2 mask, and then synthesizes a video that matches the motion while preserving identity. The workflow is organized into four groups that collaborate to produce the final result and two convenience outputs for quick QA (pose and mask previews).

Reference Image

This group loads your portrait or full‑body image, resizes it to the target resolution, and makes it available across the graph. The resized image is stored and reused by Get_reference_image and previewed so you can quickly assess framing. Identity features are encoded by WanVideoClipVisionEncode (CLIP Vision) (#70), and the same image feeds WanVideoAnimateEmbeds (#62) as ref_images for stronger identity preservation. Provide a clear, well‑lit reference that matches the subject type in the driver video for best results. Headroom and minimal occlusions help Wan 2.2 Animate V2 lock onto face structure and clothing.

Preprocessing

The driver video is loaded with VHS_LoadVideo (#191), which exposes frames, audio, frame count, and source fps for later use. Pose and face cues are extracted by OnnxDetectionModelLoader (#178) and PoseAndFaceDetection (#172), then visualized with DrawViTPose (#173) so you can confirm tracking quality. Subject isolation is handled by Sam2Segmentation (#104), followed by GrowMaskWithBlur (#182) and BlockifyMask (#108) to produce a clean, stable mask; a helper DrawMaskOnImage (#99) previews the matte. The group also standardizes width, height, and frame count from the driver video, so Wan 2.2 Animate V2 can match spatial and temporal settings without guesswork. Quick checks export as short videos: a pose overlay and a mask preview for zero‑shot validation.

Models

WanVideoVAELoader (#38) loads the Wan VAE and WanVideoModelLoader (#22) loads the Wan 2.2 Animate backbone. Optional LoRAs are chosen in WanVideoLoraSelectMulti (#171) and applied via WanVideoSetLoRAs (#48); WanVideoBlockSwap (#51) can be enabled through WanVideoSetBlockSwap (#50) for architectural tweaks that affect style and fidelity. Prompts are encoded by WanVideoTextEncodeCached (#65), while WanVideoClipVisionEncode (#70) turns the reference image into robust identity embeddings. WanVideoAnimateEmbeds (#62) fuses the CLIP features, reference image, pose images, face crops, optional background frames, the SAM 2 mask, and the chosen resolution and frame count into a single animation embedding. That feed drives WanVideoSampler (#27), which synthesizes latent video consistent with your prompt, identity, and motion cues, and WanVideoDecode (#28) converts latents back to RGB frames.

Result collage

To help compare outputs, the workflow assembles a simple side‑by‑side: the generated video alongside a vertical strip that shows the reference image, face crops, pose overlay, and a frame from the driver video. ImageConcatMulti (#77, #66) builds the visual collage, then VHS_VideoCombine (#30) renders a “Compare” mp4. The final clean output is rendered by VHS_VideoCombine (#189), which also carries over audio from the driver for quick review cuts. These exports make it easy to judge how well Wan 2.2 Animate V2 followed motion, preserved identity, and maintained the intended background.

Key nodes in ComfyUI Wan 2.2 Animate V2 workflow

VHS_LoadVideo (#191)
Loads the driving video and exposes frames, audio, and metadata used across the graph. Keep the subject fully visible with minimal motion blur for stronger keypoint tracking. If you want shorter tests, limit the number of frames loaded; keep the source fps consistent downstream to avoid audio desync in the final combine.

PoseAndFaceDetection (#172)
Runs YOLO and ViTPose to produce whole‑body keypoints and face crops that directly guide motion transfer. Feed it the images from the loader and the standardized width and height; the optional retarget_image input allows adapting poses to a different framing when needed. If the pose overlay looks noisy, consider a higher‑quality ViTPose model and ensure the subject is not heavily occluded. Reference: ComfyUI‑WanAnimatePreprocess.

Sam2Segmentation (#104)
Generates a subject mask that can preserve background or localize relighting in Wan 2.2 Animate V2. You can use the detected bounding boxes from PoseAndFaceDetection or draw quick positive points if needed to refine the matte. Pair it with GrowMaskWithBlur for cleaner edges on fast motion and review the result with the mask preview export. Reference: Segment Anything 2.

WanVideoClipVisionEncode (#70)
Encodes the reference image with CLIP Vision to capture identity cues like facial structure, hair, and clothing. You can average multiple reference images to stabilize identity or use a negative image to suppress unwanted traits. Centered crops with consistent lighting help produce stronger embeddings.

WanVideoAnimateEmbeds (#62)
Fuses identity features, pose images, face crops, optional background frames, and the SAM 2 mask into a single animation embedding. Align width, height, and num_frames with your driver video for fewer artifacts. If you see background drift, provide clean background frames and a solid mask; if the face drifts, ensure face crops are present and well lit.

WanVideoSampler (#27)
Produces the actual video latents guided by your prompt, LoRAs, and the animation embedding. For long clips, choose between a sliding‑window strategy or the model’s context options; match the windowing to clip length to balance motion sharpness and long‑range consistency. Adjust the scheduler and guidance strength to trade off fidelity, style adherence, and motion smoothness, and consider enabling block swap if your LoRA stack benefits from it.

Optional extras

Start with a clean driver clip: steady camera, simple lighting, and minimal occlusion give Wan 2.2 Animate V2 the best chance to track motion cleanly.
Use a reference that matches the target outfit and framing; avoid extreme angles or heavy filters that conflict with your prompt or LoRAs.
Preserve or replace backgrounds with the SAM 2 mask; when compositing, keep edges soft enough to avoid haloing on fast motion.
Keep fps consistent from load to export to maintain lip sync and beat alignment when carrying over audio.
For quick iteration, test a short segment first, then extend the frame range once pose, identity, and lighting look right.

Helpful resources used in this workflow:

Preprocess nodes: kijai/ComfyUI‑WanAnimatePreprocess
ViTPose ONNX models: ViTPose‑L, ViTPose‑H model and data
YOLOv10 detector: yolov10m.onnx
Wan 2.2 Animate 14B weights: Wan22Animate
LoRAs: Lightx2v, Wan22_relight

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge Benji’s AI Playground's workflow and the Wan team for the Wan 2.2 Animate V2 model for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

Wan team/Wan 2.2 Animate V2
- Docs / Release Notes: YouTube @Benji’s AI Playground

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Wan 2.2 | Open-Source Video Gen Leader

Available now! Better precision + smoother motion.

Wan 2.2 FLF2V | First-Last Frame Video Generation

Generate smooth videos from a start and end frame using Wan 2.2 FLF2V.

Wan 2.2 + Lightx2v V2 | Ultra Fast T2V

Dual Light LoRA setup reduces steps to 8, 4x faster.

Wan 2.2 Lightning T2V I2V | 4-Step Ultra Fast

Wan 2.2 now 20x faster! T2V + I2V in 4 steps.

Wan2.2 Animate | Photo to Realistic Motion Video

Turn images into lifelike, moving characters with natural body and face motion.

ComfyUI Phantom | Subjects to Video

Reference-driven video generation using Wan2.1 14B

Wan FusionX | T2V+I2V+VACE Complete

Most powerful video generation solution yet! Cinema-grade detail, your personal film studio.

SUPIR + Foolhardy Remacri | 8K Image/Video Upscaler

Upscale images to 8K with SUPIR and 4x Foolhardy Remacri model.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.