SCAIL pose‑guided character animation in ComfyUI
This workflow brings SCAIL to ComfyUI for pose‑guided, reference‑based character animation. By combining a single reference image with extracted human poses, SCAIL maintains subject identity, body structure, and coherent motion across frames while you control style with prompts. It supports either an input video for motion transfer or images plus rendered poses for choreography, then outputs multi‑frame videos with optional audio passthrough.
Use this SCAIL workflow for dance and action motion transfer, stylized character animation, and consistent multi‑shot sequences where temporal stability and accurate poses matter. Under the hood it runs on WanVideo for diffusion‑transformer video generation, augments identity via CLIP vision, and drives structure with NLF and ViTPose/DWPose pose signals, all wired for efficient long‑sequence sampling.
Note: Due to compatibility limitations, the 2XL machine cannot be used with the current ComfyUI workflow.
Key models in Comfyui SCAIL workflow
- SCAIL: Studio‑grade character animation via full‑context pose injection and a 3D‑consistent pose representation; the core of this workflow’s identity preservation and pose fidelity. GitHub, arXiv
- Wan 2.x Image‑to‑Video backbone: large video diffusion models used here as the sampler backbone for SCAIL‑conditioned generation; supports high‑quality I2V and animation tasks. Examples: Wan‑AI/Wan2.1‑I2V‑14B‑480P, Wan‑AI/Wan2.2‑Animate‑14B
- UMT5‑XXL text encoder: multilingual T5 variant used by Wan pipelines to turn prompts into conditioning embeddings. Hugging Face
- CLIP ViT‑H/14 vision encoder: extracts robust reference image features to anchor identity during video synthesis. GitHub
- ViTPose (Whole‑Body): high‑quality 2D human pose estimator that supplies dense keypoints for body, hands, and face used by SCAIL’s alignment and drawing utilities. GitHub
- DWPose: whole‑body keypoint format and models leveraged for optional face/hands detail and pose alignment. GitHub
- NLF (Neural Localizer Fields): predicts continuous human pose/shape cues that render into the SCAIL 3D‑aware pose images used for strong structural control. GitHub
- YOLOv10: fast detector used in the pose pre‑processing chain for person localization. GitHub
How to use Comfyui SCAIL workflow
Overall flow: load a reference image and an optional driving video; extract and render poses; encode the reference with CLIP vision; add SCAIL reference and SCAIL pose embeddings; assemble text conditioning; sample frames with WanVideo; decode and export the video. The graph includes public “Set_” variables so width, height, CFG, and frame count propagate automatically.
Key nodes in Comfyui SCAIL workflow
-
WanVideoAddSCAILReferenceEmbeds (#350)
- Adds identity and appearance conditioning from the reference image into the image‑embedding stream. Increase its influence when the character’s face or clothing drifts; decrease if the model refuses to adapt to large body rotations or dramatic lighting.
-
WanVideoAddSCAILPoseEmbeds (#324)
- Injects rendered SCAIL pose images as structural guidance. Raise its influence for stricter limb placement and silhouette stability; lower if motion looks too rigid or if you want more freedom for style prompts to bend the pose slightly.
-
RenderNLFPoses (#362)
- Renders continuous NLF predictions into SCAIL‑style pose images, optionally overlaying DWPose face/hands and performing pose‑to‑reference alignment. Keep the internal pose render at half the target resolution to match SCAIL’s design and avoid aliasing; disconnect DWPose to remove alignment.
-
WanVideoSamplerv2 (#348)
- Drives the main diffusion sampling with model, image/text embeds, scheduler, extra args, and
cfg. If you see temporal wobble, use a steadier scheduler or more steps; if details overshoot the reference, lower cfg so SCAIL’s identity cues lead.
-
WanVideoSchedulerv2 (#349)
- Controls denoising schedule behavior. Choose schedules that balance detail and stability; slower schedules often improve temporal consistency for sweeping motions and long sequences.
-
WanVideoClipVisionEncode (#327)
- Encodes the reference image with ViT‑H/14 and outputs CLIP image embeddings for identity. Use high‑quality, well‑lit references; frontal or 3/4 views tend to anchor faces and hair better.
- Dimensions must be divisible by 32. Keep reference and output aspect ratios aligned to avoid warping.
- SCAIL expects pose renders at half the generation resolution; this workflow auto‑calculates it so you do not need to manage it manually.
- For precise hands and expressions, keep DWPose connected to enable face/hands cues; to disable alignment only, disconnect the DWPose link but keep the rendered pose images.
- Long sequences: use the context options node to window generation for memory efficiency while keeping overlap for smooth transitions.
- If you use SCAIL preview weights repackaged for ComfyUI, grab them from the community distributions when needed. Example preview pack: Kijai/WanVideo_comfy SCAIL and Kijai/WanVideo_comfy_fp8_scaled SCAIL.
Acknowledgements
This workflow implements and builds upon the following works and resources. We gratefully acknowledge Ai Verse Z.ai (zai-org) for SCAIL (official implementation) and teal024 for the SCAIL project page for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.
Resources
- zai-org/SCAIL
- teal024/SCAIL Project Page
Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.