Pose Control LipSync with Wan2.2 S2V turns a single image, an audio clip, and a pose reference video into a synchronized talking performance. The character in your reference image follows the reference video’s body motion while lip movements match the audio. This ComfyUI workflow is ideal for avatars, story scenes, trailers, explainers, and music videos where you want tight control over pose, expression, and speech timing.
Built on the Wan 2.2 S2V 14B model family, the workflow fuses text prompts, clean vocal features, and pose maps to generate cinematic motion with stable identity. It is designed to be simple to operate while giving creators fine control over look, pacing, and framing.
The workflow combines five parts: model loading, audio preparation, image and pose inputs, conditioning, and generation. Groups run in a left‑to‑right flow, with audio length automatically setting clip duration at 16 fps.
This group loads the Wan 2.2 S2V model, its VAE, the UMT5‑XXL text encoder, and a LightX2V LoRA. The base transformer is initialized in UNETLoader
(#37) and adapted with LoraLoaderModelOnly
(#61) for faster low‑step sampling. The Wan VAE is supplied by VAELoader
(#39). Text encoders are provided by CLIPLoader
(#38) which loads the UMT5‑XXL weights referenced by Wan. You rarely need to touch this group unless you swap model files.
Drop in an audio file with LoadAudio
(#58). AudioSeparation
(#85) isolates the vocal stem so the lips follow clear speech or singing rather than background instruments. Audio Duration (mtb)
(#70) measures the clip and SimpleMath+
(#71) converts duration to a frame count at 16 fps so the video length matches your audio. AudioEncoderEncode
(#56) feeds a Wav2Vec2‑Large encoder so Wan can map phonemes to mouth shapes for accurate lip sync.
LoadImage
(#52) provides the subject still that carries identity, clothing, and camera setup. ImageResizeKJv2
(#69) reads dimensions from the image so the pipeline consistently derives target width and height for all later stages. Use a sharp, front‑facing image with an unobstructed mouth for the most faithful lip movements.
VHS_LoadVideo
(#80) imports your pose reference video. ImageResizeKJv2
(#83) adapts frames to the target size, and DWPreprocessor
(#78) turns them into pose maps with YOLOX detection plus DWPose keypoints. A final ImageResizeKJv2
(#81) aligns pose frames to the generation resolution before they are passed forward as the control video. You can preview pose outputs by routing to VHS_VideoCombine
(#95), which helps confirm that the reference framing and timing fit your subject.
Write the style and scene intent in CLIP Text Encode (Positive Prompt)
(#6) and use CLIP Text Encode (Negative Prompt)
(#7) to discourage unwanted artifacts. Prompts steer high‑level aesthetics and background motion, while the audio drives lip movements and the pose reference governs body dynamics. Keep prompts concise and aligned with your target camera angle and mood.
WanSoundImageToVideo
(#55) fuses text, audio features, the reference image, and the pose control video, then prepares a latent sequence. KSamplerAdvanced
(#64) performs low‑step denoising suited to LightX2V‑style acceleration, and VAEDecode
(#8) reconstructs frames. VHS_VideoCombine
(#62) assembles frames into an MP4 and attaches your original audio so the output is ready to review or edit.
WanSoundImageToVideo
(#55)The heart of the workflow that conditions Wan2.2‑S2V with your prompt, vocals, subject image, and pose control video. Adjust only what matters: set width
, height
, and length
to match your subject image and audio length, and plug a preprocessed pose video for motion control. Leave ref_motion
empty unless you plan to inject a separate camera track. The model’s speech‑to‑video behavior is described in Wan‑AI/Wan2.2‑S2V‑14B and Wan‑Video/Wan2.2.
DWPreprocessor
(#78)Generates pose maps using YOLOX for detection and DWPose for whole‑body keypoints. Strong pose cues help Wan follow limbs and torso while audio controls lips and expressions. If your reference has heavy camera motion, use a pose video that aligns viewpoint and timing with the intended performance. DWPose and its variants are documented in IDEA‑Research/DWPose.
KSamplerAdvanced
(#64)Executes denoising for the latent sequence. With a LightX2V LoRA loaded, you can keep steps low for fast previews while retaining motion coherence; increase steps when pushing for maximum detail. Scheduler choices affect motion smoothness versus crispness, and should be tuned together with LoRA usage as outlined for Wan in the Diffusers documentation.
VHS_LoadVideo
(#80)Imports and scrubs your pose reference. Use its in‑node frame selection tools to pick the exact segment that matches your audio segment. Keeping framing and subject size consistent with the reference image will stabilize motion transfer. The node is part of VideoHelperSuite: ComfyUI‑VideoHelperSuite.
VHS_VideoCombine
(#62)Combines generated frames and your audio into an MP4 and saves workflow metadata. Set the output frame rate to 16 fps to match the frame count computed from audio duration in this workflow. Disable or enable metadata saving depending on your asset management needs. See VideoHelperSuite documentation at ComfyUI‑VideoHelperSuite.
AudioSeparation
(#85)Isolates vocals so Wav2Vec2 features drive mouth shapes without interference from instruments or FX. If your input is already clean speech, you can bypass separation. For best results, keep audio levels consistent and minimize reverb.
This Pose Control LipSync with Wan2.2 S2V workflow gives you a fast path from audio and a still image to a controllable, on‑beat performance that looks cohesive and feels expressive.
This workflow implements and builds upon the following works and resources. We gratefully acknowledge @ArtOfficialLabs of Pose Control LipSync with Wan2.2 S2VDemo for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.
Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.