logo
RunComfy
  • Playground
  • ComfyUI
  • TrainerNew
  • API
  • Pricing
discord logo
ComfyUI>Workflows>SCAIL Model | Pose-Guided Animation Maker

SCAIL Model | Pose-Guided Animation Maker

Workflow Name: RunComfy/SCAIL
Workflow ID: 0000...1323
This pose-driven model enables creators to animate still characters using reference images and extracted human poses. You can transfer movement, maintain subject consistency, and control structure across video frames. Designed for animators and motion designers, it supports both image-to-video and video-to-video workflows. It ensures coherent motion and visual stability even during complex transformations. Perfect for crafting stylized character animations or motion studies with structural precision.

SCAIL pose‑guided character animation in ComfyUI

This workflow brings SCAIL to ComfyUI for pose‑guided, reference‑based character animation. By combining a single reference image with extracted human poses, SCAIL maintains subject identity, body structure, and coherent motion across frames while you control style with prompts. It supports either an input video for motion transfer or images plus rendered poses for choreography, then outputs multi‑frame videos with optional audio passthrough.

Use this SCAIL workflow for dance and action motion transfer, stylized character animation, and consistent multi‑shot sequences where temporal stability and accurate poses matter. Under the hood it runs on WanVideo for diffusion‑transformer video generation, augments identity via CLIP vision, and drives structure with NLF and ViTPose/DWPose pose signals, all wired for efficient long‑sequence sampling.

Note: Due to compatibility limitations, the 2XL machine cannot be used with the current ComfyUI workflow.

Key models in Comfyui SCAIL workflow

  • SCAIL: Studio‑grade character animation via full‑context pose injection and a 3D‑consistent pose representation; the core of this workflow’s identity preservation and pose fidelity. GitHub, arXiv
  • Wan 2.x Image‑to‑Video backbone: large video diffusion models used here as the sampler backbone for SCAIL‑conditioned generation; supports high‑quality I2V and animation tasks. Examples: Wan‑AI/Wan2.1‑I2V‑14B‑480P, Wan‑AI/Wan2.2‑Animate‑14B
  • UMT5‑XXL text encoder: multilingual T5 variant used by Wan pipelines to turn prompts into conditioning embeddings. Hugging Face
  • CLIP ViT‑H/14 vision encoder: extracts robust reference image features to anchor identity during video synthesis. GitHub
  • ViTPose (Whole‑Body): high‑quality 2D human pose estimator that supplies dense keypoints for body, hands, and face used by SCAIL’s alignment and drawing utilities. GitHub
  • DWPose: whole‑body keypoint format and models leveraged for optional face/hands detail and pose alignment. GitHub
  • NLF (Neural Localizer Fields): predicts continuous human pose/shape cues that render into the SCAIL 3D‑aware pose images used for strong structural control. GitHub
  • YOLOv10: fast detector used in the pose pre‑processing chain for person localization. GitHub

How to use Comfyui SCAIL workflow

Overall flow: load a reference image and an optional driving video; extract and render poses; encode the reference with CLIP vision; add SCAIL reference and SCAIL pose embeddings; assemble text conditioning; sample frames with WanVideo; decode and export the video. The graph includes public “Set_” variables so width, height, CFG, and frame count propagate automatically.

  • Inputs and sizing

    • Load a reference character image, or a video for motion transfer. The workflow resizes the reference to generation size and ensures the target dimensions are divisible by 32. If you load a video, its audio is available for passthrough to the final export.
    • Set width, height, and frame count once; the values feed the sampler, decoder, and exporter via shared getters and setters. Keep aspect ratio consistent between reference and output to minimize stretching artifacts.
  • Pose extraction (group: Pose extraction)

    • The input video frames or images are resized for analysis and fed to an NLF pose predictor and a ViTPose detector. The ViTPose output is converted into DWPose format for optional face/hands detail and for aligning the global pose to the reference subject.
    • Rendered SCAIL pose images are produced at half the generation resolution internally for efficiency, then composed to the target size, preserving depth cues and occlusions. Face/hands drawing can be toggled while still using alignment; disconnect DWPose if you want pose alignment disabled.
  • Reference identity encoding

    • The reference image is encoded with CLIP ViT‑H/14 and converted into WanVideo image embeddings. These embeddings capture color, texture, and local structure so SCAIL can keep the character consistent through challenging motion.
    • If identity drifts in long or stylized shots, keep a clean, front‑facing reference and avoid heavy crops; this strengthens the CLIP signal used downstream.
  • SCAIL pose conditioning

    • The SCAIL pose renders are injected as additional image embeddings. They act as strong structural guidance that enforces limb placement, depth ordering, and silhouette stability across frames.
    • You can swap the driving source at this stage: use extracted poses from a video for motion transfer or feed pre‑rendered SCAIL pose images to choreograph sequences without a driver.
  • Text prompt conditioning

    • Prompts are encoded to text embeddings that bias style, wardrobe, lighting, and environment. Use concise descriptors that complement the reference image; negative text can reduce over‑saturation, artifacts, or clutter.
    • Prompts are optional when you want the output to follow the reference look closely under SCAIL control.
  • Sampling and scheduling

    • The WanVideo sampler runs the diffusion‑transformer with model, scheduler, image embeds (reference + SCAIL pose), text embeds, and CFG guidance. A context options node can window long sequences for memory‑friendly generation while preserving temporal continuity.
    • If you notice flicker or soft edges, consider a slower scheduler or slightly stronger CFG; if motion feels over‑constrained, reduce overall guidance so SCAIL structure and appearance cues balance naturally.
  • Decode and export

    • Latents are decoded to frames using the Wan VAE, and the video is written with your chosen frame rate and filename prefix. The workflow can concatenate visuals for A/B slices and passes audio through when connected.
    • Inspect the output; if arms or legs clip during fast turns, revisit pose extraction quality or alignment inputs, then requeue with the same seeds for controlled iteration.

Key nodes in Comfyui SCAIL workflow

  • WanVideoAddSCAILReferenceEmbeds (#350)

    • Adds identity and appearance conditioning from the reference image into the image‑embedding stream. Increase its influence when the character’s face or clothing drifts; decrease if the model refuses to adapt to large body rotations or dramatic lighting.
  • WanVideoAddSCAILPoseEmbeds (#324)

    • Injects rendered SCAIL pose images as structural guidance. Raise its influence for stricter limb placement and silhouette stability; lower if motion looks too rigid or if you want more freedom for style prompts to bend the pose slightly.
  • RenderNLFPoses (#362)

    • Renders continuous NLF predictions into SCAIL‑style pose images, optionally overlaying DWPose face/hands and performing pose‑to‑reference alignment. Keep the internal pose render at half the target resolution to match SCAIL’s design and avoid aliasing; disconnect DWPose to remove alignment.
  • WanVideoSamplerv2 (#348)

    • Drives the main diffusion sampling with model, image/text embeds, scheduler, extra args, and cfg. If you see temporal wobble, use a steadier scheduler or more steps; if details overshoot the reference, lower cfg so SCAIL’s identity cues lead.
  • WanVideoSchedulerv2 (#349)

    • Controls denoising schedule behavior. Choose schedules that balance detail and stability; slower schedules often improve temporal consistency for sweeping motions and long sequences.
  • WanVideoClipVisionEncode (#327)

    • Encodes the reference image with ViT‑H/14 and outputs CLIP image embeddings for identity. Use high‑quality, well‑lit references; frontal or 3/4 views tend to anchor faces and hair better.

Optional extras

  • Dimensions must be divisible by 32. Keep reference and output aspect ratios aligned to avoid warping.
  • SCAIL expects pose renders at half the generation resolution; this workflow auto‑calculates it so you do not need to manage it manually.
  • For precise hands and expressions, keep DWPose connected to enable face/hands cues; to disable alignment only, disconnect the DWPose link but keep the rendered pose images.
  • Long sequences: use the context options node to window generation for memory efficiency while keeping overlap for smooth transitions.
  • If you use SCAIL preview weights repackaged for ComfyUI, grab them from the community distributions when needed. Example preview pack: Kijai/WanVideo_comfy SCAIL and Kijai/WanVideo_comfy_fp8_scaled SCAIL.

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge Ai Verse Z.ai (zai-org) for SCAIL (official implementation) and teal024 for the SCAIL project page for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

  • zai-org/SCAIL
    • GitHub: zai-org/SCAIL
    • Hugging Face: zai-org/SCAIL-Preview
    • arXiv: arXiv:2512.05905
  • teal024/SCAIL Project Page
    • Docs / Release Notes: Project Page
    • GitHub: zai-org/SCAIL
    • Hugging Face: zai-org/SCAIL-Preview
    • arXiv: arXiv:2512.05905

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Wan 2.2 | Open-Source Video Gen Leader

Available now! Better precision + smoother motion.

Wan 2.2 + Lightx2v V2 | Ultra Fast T2V

Dual Light LoRA setup reduces steps to 8, 4x faster.

Wan 2.2 Lightning T2V I2V | 4-Step Ultra Fast

Wan 2.2 now 20x faster! T2V + I2V in 4 steps.

Wan 2.2 FLF2V | First-Last Frame Video Generation

Generate smooth videos from a start and end frame using Wan 2.2 FLF2V.

Wan 2.2 Low Vram | Kijai Wrapper

Low VRAM. No longer waiting. Kijai wrapper included.

Uni3C Video-Referenced Camera & Motion Transfer

Extract camera movements and human motions from reference videos for professional video generation

SUPIR | Photo-Realistic Image/Video Upscaler

SUPIR enables photo-realistic image restoration, works with SDXL model, and supports text-prompt enhancement.

Wan 2.2 VACE | Pose-Controlled Video Generator

Turn still images into stunning motion with pose-based control.

Follow us
  • LinkedIn
  • Facebook
  • Instagram
  • Twitter
Support
  • Discord
  • Email
  • System Status
  • Affiliate
Resources
  • Free ComfyUI Online
  • ComfyUI Guides
  • RunComfy API
  • ComfyUI Tutorials
  • ComfyUI Nodes
  • Learn More
Legal
  • Terms of Service
  • Privacy Policy
  • Cookie Policy
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.