ComfyUI>Workflows>Wan 2.1 Ditto | Cinematic Video Restyle Generator

Wan 2.1 Ditto | Cinematic Video Restyle Generator

Workflow Name: RunComfy/Wan-2-1-Ditto

Workflow ID: 0000...1302

This workflow helps you transform existing or generated videos into new artistic styles while keeping motion stable and structure accurate. You can apply cinematic, painterly, or abstract visual effects directly in your video pipeline. It offers strong temporal coherence for smooth transitions between frames. With intuitive controls, it streamlines your creative process and ensures consistent, high-quality results. Perfect for editors and designers seeking refined, stylized video outputs.

Wan 2.1 Ditto video restyle workflow for ComfyUI

This workflow applies Wan 2.1 Ditto to restyle any input video while preserving scene structure and motion. It is designed for editors and creators who want cinematic, artistic, or experimental looks with strong temporal consistency. You load a clip, describe the target look, and Wan 2.1 Ditto produces a clean stylized render plus an optional side‑by‑side comparison for quick review.

The graph pairs the Wan 2.1 text‑to‑video backbone with Ditto’s style transfer at the model level, so changes happen coherently across frames rather than as frame‑by‑frame filters. Common use cases include anime conversions, pixel art, claymation, watercolor, steampunk, or sim‑to‑real edits. If you already generate content with Wan, this Wan 2.1 Ditto workflow slots directly into your pipeline for dependable, flicker‑free video styling.

Key models in Comfyui Wan 2.1 Ditto workflow

Wan2.1‑T2V‑14B text‑to‑video model. Serves as the generative backbone that synthesizes temporally consistent motion given text and visual conditioning.
Wan 2.1 VAE. Encodes and decodes video latents so the sampler can work in a compact space and then reconstruct full‑resolution frames reliably.
mT5‑XXL text encoder. Converts prompts to rich language embeddings that steer scene content and style. For background on mT5, see the paper by Xue et al. mT5: A Massively Multilingual Pre‑trained Text‑to‑Text Transformer.
Ditto stylization model for Wan 2.1. Provides robust, global restyling with strong temporal coherence. The Ditto approach and model files are documented here: EzioBy/Ditto.
Optional LoRA for Wan 2.1 14B. Adds lightweight style or behavior shifts without retraining the base model, following the LoRA method described in Hu et al., 2021.

How to use Comfyui Wan 2.1 Ditto workflow

The workflow runs in four stages: load models, prepare the input video, encode text and visuals, then sample and export. Groups operate in sequence to produce both a stylized render and an optional side‑by‑side comparison.

Models

This group prepares everything Wan 2.1 Ditto needs. The base backbone is loaded with WanVideoModelLoader (#130) and paired with the WanVideoVAELoader (#60) and LoadWanVideoT5TextEncoder (#80). The Ditto component is selected with WanVideoVACEModelSelect (#128), which points the backbone to the dedicated Ditto stylization weights. If you need a stronger transformation, you can attach a LoRA with WanVideoLoraSelect (#122). WanVideoBlockSwap (#68) is available for memory management so larger models can run smoothly on limited VRAM.

Input parameters

Load your source clip with VHS_LoadVideo (#101). The frames are then resized for consistent geometry using LayerUtility: ImageScaleByAspectRatio V2 (#76), which preserves aspect while targeting a long‑side resolution controlled by a simple integer input JWInteger (#89). GetImageSizeAndCount (#65) reads the prepared frames and forwards width, height, and frame count to downstream nodes so Wan 2.1 Ditto samples the correct spatial size and duration. A small prompt helper CR Text (#104) is included if you prefer to author the prompt in its own field. The group titled “Maximum Variation Limit” reminds you to keep the long‑side pixel target in a practical range for consistent results and stable memory use.

Sampling

Conditioning happens in two parallel lanes. WanVideoTextEncode (#111) turns your prompt into text embeddings that define the intent and style. WanVideoVACEEncode (#126) encodes the prepared video into visual embeddings that preserve structure and motion for editing. An optional guidance module WanVideoSLG (#129) controls how the model balances style and content through the denoising trajectory. WanVideoSampler (#119) then fuses the Wan 2.1 backbone with Ditto, the text embeddings, and the visual embeddings to generate stylized latents. Finally, WanVideoDecode (#87) reconstructs frames from latents to produce the stylized sequence with the temporal consistency that Wan 2.1 Ditto is known for.

Outputs and comparisons

The primary export uses VHS_VideoCombine (#95) to save the Wan 2.1 Ditto render at your selected frame rate. For quick review, the graph joins original and stylized frames using ImageConcatMulti (#94), sizes the comparison with ImageScaleToTotalPixels (#133), and writes a side‑by‑side movie via VHS_VideoCombine (#100). You will typically get two videos in the output folder: a clean stylized render and a comparison clip that helps stakeholders approve or iterate faster.

Prompt ideas

You can begin with short, clear prompts and iterate. Examples that work well with Wan 2.1 Ditto:

Make it a Japanese anime style, cel shading video.
Make it a Pixel Art video.
Make it a pencil sketch style video.
Make it a Claymation video.
Make it a watercolor drawing style video.
Make it Steampunk style with gears, pipes and brass details.
Make it Cyberpunk style with neon and futuristic implants.
Make it a Ukiyo‑e style video.
Make it a Renaissance art style video.
Make it a drawing by Van Gogh.
Turn it into the LEGO style.
Turn it into the Ghibli style.
Turn it into the 3D Chibi style.
Turn it into the Paper Cutting style.

Key nodes in Comfyui Wan 2.1 Ditto workflow

WanVideoVACEModelSelect (#128)
Choose which Ditto weights to use for stylization. The default global Ditto model is a balanced choice for most footage. If your goal is anime‑to‑real conversion, select the sim‑to‑real Ditto variant referenced in the node note. Switching Ditto variants changes the character of the restyle without touching other settings.

WanVideoVACEEncode (#126)
Builds the visual conditioning from your input frames. The key controls are width, height, and num_frames, which should match the prepared video for best results. Use strength to adjust how assertively Ditto’s style influences the edit, and vace_start_percent and vace_end_percent to limit when conditioning applies across the diffusion trajectory. Enable tiled_vae on very large resolutions to reduce memory pressure.

WanVideoTextEncode (#111)
Encodes positive and negative prompts via the mT5‑XXL encoder to guide style and content. Keep positive prompts concise and descriptive, and use negatives to suppress artifacts such as flicker or over‑saturation. The force_offload and device options let you trade speed for memory if you are running large models.

WanVideoSampler (#119)
Runs the Wan 2.1 backbone with Ditto stylization to generate the final latents. The most impactful settings are steps, cfg, scheduler, and seed. Use denoise_strength when you want to preserve more of the original structure, and keep slg_args connected to balance content fidelity against style strength. Increasing steps or guidance may improve detail at the cost of time.

ImageScaleByAspectRatio V2 (#76)
Sets a stable target size for all frames before conditioning. Drive the long‑side target with the standalone integer so you can test small, fast previews and then increase resolution for final renders. Keep the scale consistent between iterations to make A/B comparisons meaningful.

VHS_LoadVideo (#101) and VHS_VideoCombine (#95, #100)
These nodes handle decoding and encoding. Match frame rates to the source when you care about timing. The comparison writer is useful during exploration and can be disabled for final exports if you only want the stylized result.

Optional extras

For anime‑to‑real edits, pick the sim‑to‑real Ditto variant in WanVideoVACEModelSelect before sampling.
Start with short prompts like “Make it watercolor drawing style” and refine with 1 or 2 descriptors. Long lists tend to dilute style strength.
Use negative prompts to reduce flicker, compression artifacts, and over‑bright highlights when pushing strong looks.
Keep your long‑side resolution consistent across iterations to stabilize results and make seeds reproducible.
When VRAM is tight, enable model offloading and tiling options, or preview at a smaller long‑side value before rendering at full size.

This Wan 2.1 Ditto workflow makes high‑quality video restyling predictable and fast, with clean prompts, coherent motion, and outputs ready for immediate review or delivery.

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge EzioBy for Wan 2.1 Ditto Source for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

EzioBy/Wan 2.1 Ditto Source
- GitHub: EzioBy/Ditto

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Wan 2.1 Video Restyle | Consistent Video Style Transform

Transform your video style by applying the restyled first frame using Wan 2.1 video restyle workflow.

Wan 2.1 LoRA

Enhance Wan 2.1 video generation with LoRA models for improved style and customization.

Wan 2.1 Control LoRA | Depth and Tile

Advance Wan 2.1 video generation with lightweight depth and tile LoRAs for improved structure and detail.

Wan 2.1 | Revolutionary Video Generation

Create incredible videos from text or images with breakthrough AI running on everyday CPUs.

Wan FusionX | T2V+I2V+VACE Complete

Most powerful video generation solution yet! Cinema-grade detail, your personal film studio.

FLUX LoRA (RealismLoRA) | Photorealistic Images

Blend FLUX-1 model with FLUX-RealismLoRA for photorealistic AI images

AnimateDiff + ControlNet | Marble Sculpture Style

Transform your videos into timeless marble sculptures, capturing the essence of classic art.

ComfyUI Phantom | Subjects to Video

Reference-driven video generation using Wan2.1 14B

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.