SkyReels V3 ComfyUI Workflow

Want to run this workflow?

Fully operational workflows
No missing nodes or models
No manual setups required
Features stunning visuals

SkyReels V3 ComfyUI Examples

SkyReels V3 ComfyUI: identity‑faithful image, video, and audio to video creation#

SkyReels V3 ComfyUI is a production‑ready workflow that brings the SkyReels V3 multimodal video model into ComfyUI so you can animate still images, extend existing shots, and build audio‑driven talking avatars with precise lip‑sync. It is designed for creators who want cinematic motion, strong subject identity, and temporal coherence while staying inside a flexible node graph.

The workflow ships with four focused pipelines that can be run independently or chained: image to video character animation, video to video continuation, audio to video talking avatars, and next‑shot generation for story flow. Each path includes clear input points and sensible defaults so you can drop in your assets and render high‑quality SkyReels V3 outputs quickly.

Note for 2X Large and larger machines (R2V workflow): Set Patch Sage Attention KJ (#240) sage_attention to disabled before running. Leaving it enabled can trigger SM90 kernel is not available errors.

Key models in Comfyui SkyReels V3 ComfyUI workflow#

SkyReels V3 video backbones (R2V, V2V Shot, A2V) from the WanVideo FP8 pack. These are the core generators that handle identity‑aware motion, video continuation, and audio‑conditioned lip‑sync. See the SkyReels V3 weights in the WanVideo pack on Hugging Face here.
OpenCLIP Vision ViT models for image guidance and reference embedding. They provide robust visual features that help preserve look and style across frames. Project page: open_clip.
UMT5 text encoder for prompt understanding. It supplies rich language conditioning to steer style, scene, and actions. Repo: umt5.
Wav2Vec2 speech features for lip‑sync and audio analysis. The Chinese base variant is supported out of the box and similar English variants work as well. Model card: TencentGameMate/chinese-wav2vec2-base.
Qwen3‑ASR‑1.7B for speech‑to‑text. Used to transcribe reference audio and bootstrap voice‑cloned TTS prompts. Model card: Qwen/Qwen3-ASR-1.7B.
MelBandRoFormer for vocal separation. Helpful when you need clean speech tracks before lip‑sync embedding. Model card: Kijai/MelBandRoFormer_comfy.
MiniCPM‑V for shot‑aware prompt generation. It analyzes prior footage and proposes the next shot for story continuity. Model hub: OpenBMB/MiniCPM-V.

How to use Comfyui SkyReels V3 ComfyUI workflow#

The graph is organized into four pipelines. You can run any one on its own or in sequence to build longer edits.

Image to Video character animation#

Models. Load the UNet, CLIP, and VAE in the Models group using UNETLoader (#241), CLIPLoader (#242), and VAELoader (#194). The model patch nodes PathchSageAttentionKJ (#240) and ModelPatchTorchSettings (#239) optimize attention and math settings, while LoraLoaderModelOnly (#250) lets you optionally blend a style or motion LoRA into the SkyReels model.
Load reference images. Use the three “Load reference images” groups to import 1–3 portraits or poses. The resize helpers ImageResizeKJv2 (#291, #298, #299, #304) align aspect ratio and batch them; cleaner identity photos yield more stable results.
Prompt. Enter scene and action text in the Prompt group with CLIPTextEncode (#6) and an optional negative text encoder CLIPTextEncode (#7) to nudge away undesired traits. Keep language concise and specific to motion and framing.
Sampling and decoding. WanPhantomSubjectToVideo (#249) fuses your references and prompts into an identity‑aware latent that feeds KSampler (#149) through ModelSamplingSD3 (#48). The decoded frames from VAEDecode (#264) are packaged into a movie with VHS_VideoCombine (#280); set your target frame rate and file format there.

Video to Video extend loop#

Input video and settings. Bring in your source clip with VHS_LoadVideo (#329). Set how many extra segments to generate and how much overlap between segments using the integer helpers “Number of Extend” (#342) and “Overlapping Frames” (#341). ImageResizeKJv2 (#327) standardizes resolution for the sampler.
Loop sampling extend video. The loop pair easy forLoopStart (#331) and easy forLoopEnd (#332) walks over the clip in windows to stabilize transitions. Each window is encoded with WanVideoEncode (#326), receives neutral or control embeds via WanVideoEmptyEmbeds (#328), and is denoised by WanVideoSampler (#320) from WanVideoModelLoader (#319). Frames are decoded with WanVideoDecode (#321) and previewed or saved with VHS_VideoCombine (#322, #335).
Performance helpers. WanVideoTorchCompileSettings (#323) and WanVideoBlockSwap (#325) enable compile and memory tricks for longer or higher‑resolution runs.

Audio to Video talking avatar#

1 – Create audio. You can generate a voice‑cloned speech track with FB_Qwen3TTSVoiceClonePrompt (#416) and FB_Qwen3TTSVoiceClone (#412), or load any pre‑recorded voice with LoadAudio (#417). Qwen3ASRLoader (#414) plus Qwen3ASRTranscribe (#413) help you extract text from a reference clip to seed the TTS prompt if desired.
2 – Audio features. DownloadAndLoadWav2VecModel (#348) feeds MultiTalkWav2VecEmbeds (#350) to create lip‑motion embeddings from your speech; length is aligned to the audio and previewable with PreviewAudio (#422). Use Any Switch (rgthree) (#435) to pick TTS output or your imported file as the driving track.
3 – Input image. Load the talking face in the “3 - Input image” group and size it with ImageResizeKJv2 (#370). Clean, front‑facing portraits with consistent lighting work best.
Reference video gen. First, create a short visual anchor from the still using WanVideoImageToVideoEncode (#392). CLIP‑Vision features from CLIPVisionLoader (#352) and WanVideoClipVisionEncode (#351) stabilize identity across the next stage; a scheduler WanVideoSchedulerv2 (#385) is prepared in the Sampling Setting group.
Generate audio lip‑sync. WanVideoImageToVideoSkyreelsv3_audio (#383) combines the start image, optional reference frames, and CLIP‑Vision embeds into image conditioning. WanVideoSamplerv2 (#384) then denoises with the SkyReels A2V model while WanVideoSamplerExtraArgs (#386) injects the MultiTalk lip‑sync embeddings for accurate mouth shapes. WanVideoPassImagesFromSamples (#381) streams decoded frames to VHS_VideoCombine (#346) where the final video is muxed with your audio.

Video to Video next‑shot generation#

Video frames preprocess. Import the previous shot with VHS_LoadVideo (#443) and resize it via ImageResizeKJv2 (#441). GetImageRangeFromBatch (#445) selects a context slice that WanVideoEncode (#440) turns into latents; WanVideoEmptyEmbeds (#442) prepares the conditioning window.
Auto video prompt. CreateVideo (#450) assembles a compact proxy clip from the context frames which AILab_MiniCPM_V_Advanced (#449) analyzes to draft a next‑shot prompt. Inspect or refine the draft in ShowText|pysssss (#447) and embed it with WanVideoTextEncodeCached (#444) before sampling.
Models and sampling. Load the V2V Shot model with WanVideoModelLoader (#436) and WanVideoVAELoader (#438); optional WanVideoBlockSwap (#439) handles VRAM. The WanVideoSampler (#451) generates the continuation, WanVideoDecode (#437) renders frames, and VHS_VideoCombine (#446) outputs the final shot. This SkyReels V3 ComfyUI path is ideal for storyboards and previz where each new cut should respect the last one.

Key nodes in Comfyui SkyReels V3 ComfyUI workflow#

WanPhantomSubjectToVideo (#249). Builds an identity‑aware latent from your batched reference images plus text cues, which then drives the sampler. Adjust the number and diversity of references to balance likeness lock versus creative motion; keep the resize nodes feeding it consistent to avoid drift. Reference: WanVideo Wrapper on GitHub contains implementation notes and expected inputs ComfyUI‑WanVideoWrapper.
WanVideoImageToVideoEncode (#392). Encodes a still into a stable shot seed and optionally blends CLIP‑Vision guidance for pose and framing. Use it to create anchor frames before the audio‑driven stage so identity and camera setup remain consistent across pipelines. Wrapper docs: ComfyUI‑WanVideoWrapper.
WanVideoImageToVideoSkyreelsv3_audio (#383). Prepares image embeds tailored for the A2V sampler and merges optional reference video frames. Ensure its width and height match the sampler path; pair it with WanVideoSamplerv2 and MultiTalkWav2VecEmbeds for precise lip‑sync.
WanVideoSamplerv2 (#384, #387). The main denoiser for SkyReels V3 that accepts image and text embeds plus scheduler settings. The WanVideoSamplerExtraArgs nodes (#386, #409) are where lip‑sync, loop, or context features are injected; keep these connected when switching between A2V and I2V models. Implementation details: ComfyUI‑WanVideoWrapper.
MultiTalkWav2VecEmbeds (#350). Converts speech into temporally aligned embeddings that drive mouth motion. Matching the intended frame budget and ensuring clean vocals significantly improves phoneme accuracy. Wav2Vec reference model: TencentGameMate/chinese-wav2vec2-base.
AILab_MiniCPM_V_Advanced (#449). Analyzes the previous shot and drafts a structured prompt for character, background, action, mood, and lighting. Use this to keep narrative continuity when using the V2V next‑shot path; the resulting text flows into WanVideoTextEncodeCached. Model family: OpenBMB/MiniCPM-V.

Optional extras#

Keep image, video, and sampler resolutions consistent across connected nodes to avoid aspect warps and identity flicker.
For longer extensions, increase window overlap in the V2V extend loop to smooth transitions between segments.
If GPU memory is tight, leave the Reserved VRAM nodes (ReservedVRAMSetter (#312, #448)) enabled and use the compile settings blocks before sampling.
When talking avatars drift off‑beat, prioritize clean speech or separate vocals with MelBandRoFormer before creating MultiTalk embeddings.
Final delivery settings such as frame rate, pix format, and CRF are controlled in the VHS_VideoCombine output nodes; match frame rate to your source for seamless edits.

This README covers the complete SkyReels V3 ComfyUI graph so you can choose the path that fits your project, combine them when needed, and render consistent, story‑ready video with minimal trial and error.

Acknowledgements#

This workflow implements and builds upon the following works and resources. We gratefully acknowledge @Benji’s AI Playground and SkyReels for the SkyReels V3 ComfyUI workflow for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources#

SkyReels/V3 ComfyUI Source
- Docs / Release Notes: SkyReels V3 ComfyUI Source from @Benji’s AI Playground

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

SkyReels V1 | Human-Focused Video Creation

Generate cinematic human videos with genuine facial expressions and natural movements from text or images.

SkyReels-A2 | Multi-Element Video Generation

Combine multi elements into dynamic videos with precision.

LatentSync| Lip Sync Model

Advanced audio-driven lip sync technology.

Sonic | Lip-Sync Portrait Animation

Sonic delivers advanced audio-driven lip-sync for portraits with high-quality animation.

MultiTalk | Photo to Talking Video

Millisecond lip sync + Wan2.1 = 15s ultra-detailed talking videos!

Wonder3D | ComfyUI 3D Pack

Generate multi-view normal maps and color images for 3D assets.

Fantasy Portrait | Expressive Photo Animation

Photo → expressive cinematic face animation, fast and identity-accurate.

Z-Image De-Turbo LoRA Inference | AI Toolkit ComfyUI

Run your AI Toolkit-trained Z-Image De-Turbo LoRA in ComfyUI with training-matched behavior using a single RCZimageDeturbo custom node.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

SkyReels V3 ComfyUI | AI Video Generator