Wan 2.6: High-Fidelity reference-to-video Lip-Sync & Motion Transfer

wan-ai/wan-2-6/video-to-video

Wan 2.6 transforms reference videos and prompts into 1080p, 24fps clips with realistic lip-sync, multi-shot storytelling, and precise motion transfer for easy, production-ready video creation.

Prompt *

Overall Description: This video depicts a nostalgic, cinematic journey through a sun-drenched vintage street during the golden hour. The atmosphere is warm, peaceful, and reminiscent of an old film, focusing on the interplay of light, shadow, and historical architecture. 
 Shot 1 [0-4 seconds]: A slow, sweeping tracking shot moves through the vintage street, capturing the golden-hour glow. character1 is walking leisurely along the sidewalk, bathed in warm sunlight. The lens flares dance as the light filters through the gaps between historic buildings, casting long, dramatic shadows around character1. 
Shot 2 [4-7 seconds]: The camera continues to track as a classic car glides smoothly across the frame near character1. The car's polished surface reflects the warm hues of the setting sun, creating a contrast with the textured old buildings. 
Shot 3 [7-10 seconds]: The focus shifts to the atmospheric details. Dust particles shimmer in the light around character1. The scene evokes a deep sense of nostalgia as character1 moves in the background, completing the cinematic vintage atmosphere.

Length should be less than 1500 characters. According to the order of the reference videos, refer to the reference videos as characterX. For example, when there are two reference videos, the prompt should be written as character1 singing on the side of the road and character2 dancing next to him. When there is a single reference, you also need to write character1 with 1.

Reference Video 1 *

Video format must be: mp4, mov. The duration of this video must be between 2s and 30s. The file size must be less than 100 MB. When you only use this input, the longest reference is 5s.

Reference Video 2

Video format must be: mp4, mov. The duration of this video must be between 2s and 30s. The file size must be less than 100 MB. When you use this input and Reference Video 1, the longest reference is 2.5s each.

Reference Video 3

Video format must be: mp4, mov. The duration of this video must be between 2s and 30s. The file size must be less than 100 MB. When you use this input and Reference Video 1 and Reference Video 2, the longest reference is 1.66s each.

Duration

Resolution

Shot Type

Negative Prompt

Seed

Generate Audio

Idle

The rate is $0.06 per second for 720p+, $0.09 per second for 1080p+

Introduction to Wan 2.6 Video

Alibaba's Wan 2.6 converts reference videos and prompts into 1080p, 24fps clips up to 10s with selectable 16:9, 9:16, or 1:1 ratios and precise, native audio-visual lip-sync. Trading manual frame editing, shot-by-shot storyboarding, and separate dubbing for multishot auto-story expansion and reference-accurate voice and motion transfer, Wan 2.6 streamlines production by eliminating complex masking and re-timing, built for marketing agencies, e-commerce teams, filmmakers, educators, and corporate communications. For developers, Wan 2.6 on RunComfy can be used both in the browser and via an HTTP API, so you don’t need to host or scale the model yourself.
Ideal for: Multishot Narrative Prototyping | Brand-True Product Videos with Lip-Sync | Reference-Accurate Character and Motion Transfer

Examples of Wan 2.6

Wan 2.6 Video to Video on X: Content Drops And Insights

Wan AI / wan-2.6 Reference-to-Video

wan 2.6 reference-to-video on RunComfy is a production-grade AI video-to-video engine that takes reference videos and a descriptive prompt to generate cinematic videos with consistent motion, subject identity, and audio-visual alignment. It is designed for re-styling existing footage, motion-guided storytelling, and reference-accurate video generation.

Output format: MP4, MOV, or WebM at up to 1080p and 24 fps.

Highlights

Reference-Driven Video Generation — Uses reference videos to preserve motion patterns, subject identity, and temporal coherence.
Audio-Visual Consistency — Produces videos with stable visual motion and synchronized audio output when applicable.
Multi-Shot Narrative Friendly — Supports prompt structures that describe multiple shots or scene transitions.
High-Resolution Output — Generates up to 1080p video suitable for professional and commercial use.
Flexible Creative Control — Combine reference inputs with textual direction to change style, mood, or environment.
RunComfy Integration — Available via browser Models and API for scalable workflows.

Parameters

Name	Type	Required	Description
`prompt*`	string	Yes	Text prompt describing the desired scene, motion, and style (≤2000 characters).
`audio_url`	video_uri	No	Reference video file used to guide motion and content (file size < 15 MB).
`duration`	integer	No	Output video length (choices: 5, 10).
`img_url*`	image_uri	Yes	First-frame reference image (supported formats: jpg, jpeg, png, bmp, webp; 360–2000 px).
`resolution`	string	No	Output resolution (`480P`, `720P`, `1080P`).
`negative_prompt`	string	No	Optional text to suppress unwanted styles or artifacts.

> Required parameters are marked with *.

Pricing

Pricing on RunComfy is usage-based and depends on resolution and video length.

Resolution	Cost
720p+	$0.05 per second
1080p+	$0.08 per second

How to Use

1) Prepare Reference Inputs

- Select a short reference video that represents the desired motion or subject behavior.

- Choose a clear first-frame image for img_url.

2) Upload to RunComfy

- Upload the reference video and image in the RunComfy Models or via API.

3) Write the Prompt

- Describe what should follow the reference and what should change (style, environment, camera, mood).

4) Optional: Add a Negative Prompt

- Exclude artifacts, unwanted styles, or visual noise if needed.

5) Select Output Settings

- Choose the desired resolution and duration.

6) Generate the Video

- Run the model and preview the output directly in the Models.

7) Iterate or Download

- Refine prompts or references and regenerate, or download the final video.

Prompt & Reference Tips

Align Reference and Goal — The closer the reference matches the target motion, the more stable the result.
Describe Changes Explicitly — Use phrases like “follow the reference motion but change the environment to…”.
Control Style via Prompt — Cinematic terms, lighting, and camera movement help shape the output.
Keep Prompts Focused — Avoid conflicting instructions that may confuse motion or subject tracking.
Use Negative Prompts Carefully — Only block clearly unwanted outcomes to avoid over-constraining the model.

More Models to Try

Wan 2.6 Text-to-Video — Generate videos directly from text without reference footage.
Wan 2.6 Image-to-Video — Animate a still image into a short video sequence.

Official Resources

Wan AI official site: https://wan-ai.co/wan-2-6

How Wan 2.6 compares to other models

Vs Wan 2.5 Generation: Compared to Wan 2.5, Wan 2.6 delivers native audio-visual synchronization with precise lip-sync, stronger reference video adherence, more stable motion, and longer coherent multi-shot clips. Ideal Use Case: Choose Wan 2.6 when speech alignment and multi-scene continuity are critical.
Vs Wan 2.2 (open-source family): Compared to Wan 2.2, Wan 2.6 delivers higher resolution (1080p vs up to 720p), built-in audio sync/lip-sync, and improved temporal consistency for production-ready results. Ideal Use Case: Use Wan 2.6 for commercial projects needing polished audio-visual outputs without additional tooling.
Vs Seedance 1.0 Pro: Compared to Seedance 1.0 Pro, Wan 2.6 delivers native audio-video alignment in a single pass, reducing reliance on external audio editing workflows. Ideal Use Case: Select Wan 2.6 when you need immediate lip-synced dialogue or tightly timed visuals with music.
Vs Kling Video 2.6: Compared to Kling 2.6, Wan 2.6 delivers stronger reference video generation, narrative continuity, and multiple aspect ratio workflows, while matching on 1080p output and native A/V sync. Ideal Use Case: Pick Wan 2.6 for reference-driven storytelling and consistent brand visuals across formats.

Related Models

seedance-1-0/pro/image-to-video

Create fluid, expressive animations with multi-shot storytelling features.

wan-2-2/image-to-video

Refined AI visuals, real-time control, and pro FX for creators

wan-2-1/image-to-video

Master complex motion, physics, and cinematic effects.

veo-3-1/reference-to-video

Create rapid high-quality video drafts with precise style and speed

seedance-v1.5-pro/text-to-video

Create camera-controlled, audio-synced clips with smooth multilingual scene flow for design pros.

kling-video-o1/image-to-video

Transform static visuals into cinematic motion with Kling O1's precise scene control and lifelike generation.

Frequently Asked Questions

What are the technical limitations of Wan 2.6 regarding resolution, duration, and aspect ratios?

Wan 2.6 currently outputs up to 1080p resolution at 24fps and supports multiple aspect ratios including 16:9, 9:16, and 1:1. Across its capabilities, longer durations are achieved through multi-shot narrative chaining, and the system is optimized for short-to-moderate clips typically ranging from 5 to 10 seconds per segment when applicable to motion content.

How do I transition from testing Wan 2.6 in the RunComfy to using the API in production?

To move from trial to production, prototype your Wan 2.6 prompts in the RunComfy Models, then apply the same parameters within the RunComfy API by retrieving your API access key from your dashboard. The Wan 2.6 endpoints mirror playground behavior, ensuring that production results match your web UI tests. Integration examples and rate-limit information are available in the official API documentation.

How does Wan 2.6 improve upon Wan 2.5 in terms of generation quality?

Compared to Wan 2.5, Wan 2.6 offers enhanced motion stability, stronger character consistency, better lip-sync accuracy, and native audio generation where applicable. The overall pipeline with Wan 2.6 is more stable with improved temporal coherence and finer detail handling, resulting in smoother and more realistic visual and multimodal outputs.

What makes Wan 2.6 stand out from competitors like Kling Video 2.6 or Seedance 1.0 Pro?

Wan 2.6 differentiates itself with integrated audio-visual synchronization, precise lip-sync, and reference-guided motion transfer across tasks. Unlike some competitors that may require separate audio production steps, Wan 2.6 produces synchronized audio and visuals in a single pass, making it a more integrated tool for diverse production environments.

Does Wan 2.6 support multilingual text prompts and audio generation?

Yes, Wan 2.6 supports multilingual inputs and outputs, automatically generating localized audio content and lip-synced speech across supported languages. This multilingual capability extends through its multimodal modes, enabling globalized storytelling from a single interface.

Can Wan 2.6 generate audio automatically during content creation?

Wan 2.6 includes native audio generation for voiceovers, background music, and sound effects aligned with generated visuals or other content forms. The model’s multimodal architecture ensures precise audio-video synchronization without manual sound design intervention during conversion processes.

What kind of content is Wan 2.6 best suited for?

Wan 2.6 excels in short-form visual storytelling such as ads, social media clips, explainers, product spots, and training content. Its flexible architecture allows creators to build coherent multi-sequence narratives with consistent characters, motion, and audio across scenes, ideal for marketing, e-learning, and interactive use cases.

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

Wan 2.6: High-Fidelity reference-to-video Lip-Sync & Motion Transfer | RunComfy