Create rich cinematic clips from images or text with Veo 3.1 Fast.
Wan 2.6: High-Fidelity video-to-video Lip-Sync & Motion Transfer on playground and API | RunComfy
Wan 2.6 transforms reference videos and prompts into 1080p, 24fps clips with realistic lip-sync, multi-shot storytelling, and precise motion transfer for easy, production-ready video creation.
Introduction to Wan 2.6 Video
Wan AI's Wan 2.6 video to video converts reference videos and prompts into 1080p, 24fps clips up to 15s with selectable 16:9, 9:16, or 1:1 ratios and precise, native audio-visual lip-sync. Trading manual frame editing, shot-by-shot storyboarding, and separate dubbing for multishot auto-story expansion and reference-accurate voice and motion transfer, Wan 2.6 streamlines production by eliminating complex masking and re-timing, built for marketing agencies, e-commerce teams, filmmakers, educators, and corporate communications. For developers, Wan 2.6 on RunComfy can be used both in the browser and via an HTTP API, so you don’t need to host or scale the model yourself.
Ideal for: Multishot Narrative Prototyping | Brand-True Product Videos with Lip-Sync | Reference-Accurate Character and Motion Transfer
Examples of Wan 2.6
















Wan 2.6 Video to Video on X: Content Drops And Insights
Model Overview
- Provider: Wan AI
- Task: video-to-video (Multi-Reference)
- Max Resolution/Duration: Up to 1920x1080; 5s or 10s clips
- Summary: Wan 2.6 Video to Video is a specialized "Multi-Subject Identity Cloning" engine. Unlike standard style transfer tools, this model allows you to upload up to 3 distinct reference videos (e.g., a person, a pet, and an object) and direct them to perform entirely new actions in a generated scene. By using the specific keyword syntax (
character1,character2, etc.), you can choreograph complex interactions between these reference subjects while locking their visual appearance and audio characteristics.
Key Capabilities
1. Multi-Subject Identity Locking
- Up to 3 References: You can upload 1 to 3 reference videos (
reference_video_url_1to3). The model extracts the identity from each and maps them to specific keywords in your prompt. - Mapping Logic:
- Reference Video 1 = character1
- Reference Video 2 = character2
- Interactions: You can write a script where
character1(e.g., a singer) interacts withcharacter2(e.g., a dancer) in a completely new environment.
2. Multi-Shot Narrative Control
- Shot Type Control: Includes a
shot_typeparameter. Set it tomultito enable dynamic camera cuts (e.g., "Shot 1", "Shot 2") within a single video, orsinglefor a continuous take. - Timeline Scripting: Supports accurate time-based prompting (e.g., "[0-4s] character1 walks, [4-7s] close up of character1") to direct the acting performance.
3. Audio & Visual Fidelity
- 360° Understanding: The model works best when reference videos show the subject from multiple angles (close-ups, rotations). It can also utilize the audio from the reference video to maintain color/lighting consistency or voice vibe.
Master the Prompting Syntax
CRITICAL RULE: You MUST refer to your subjects as character1, character2, or character3 in the prompt. Do not use generic names like "the man" or "the dog" if you want to lock the identity.
The Formula: [Overall Description] + [Shot #] [Timestamp] [Action involving characterX]
Example Prompts
Scenario: Single Subject (1 Reference Video)
> "Overall Description: A cinematic vintage movie scene, golden hour.
> Shot 1 [0-5s]: character1 is walking down a busy street, wearing a trench coat, looking around nervously.
> Shot 2 [5-10s]: Hard cut. Close up on character1's face as they spot something in the distance."
Scenario: Two Subjects Interaction (2 Reference Videos)
> "Overall Description: A sunny park scene.
> Shot 1 [0-4s]: character1 (the dog) is running through the grass chasing a ball.
> Shot 2 [4-10s]: character2 (the owner) laughs and claps their hands, calling out to character1."
Input Parameters
Core Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
| prompt | string | Required | The script. Must use character1, character2 syntax. Includes shot breakdown. |
| shot_type | choice | multi | multi: Enables dynamic cuts (Shot 1, Shot 2). single: One continuous camera take. |
| duration | choice | 5 | Length of the generated video (5 or 10 seconds). |
| size | choice | 1280*720 | Output dimensions. Includes 16:9 (1280x720, 1920x1080), 9:16 (720x1280, 1080x1920), and 1:1 (960x960, 1440x1440) options. |
Reference Inputs
| Parameter | Type | Description |
|---|---|---|
| reference_video_url_1 | video | Required. Source for character1. Best if 2s-30s long, showing multiple angles. (<30MB) |
| reference_video_url_2 | video | Optional. Source for character2. (<30MB) |
| reference_video_url_3 | video | Optional. Source for character3. (<30MB) |
Note on References: Using more reference videos reduces the "effective context window" for each. With 1 video, the model reads up to 5s of context. With 3 videos, it reads about 1.66s from each.
How to use Wan 2.6 Video to Video
- Upload References: Upload the video of your main subject to
Reference Video 1. If you have a second subject, upload toReference Video 2. - Select Shot Type: Choose
multiif you want to describe multiple camera angles/cuts, orsinglefor a steady shot. - Select Resolution: Choose a
size(e.g.,1920x1080for landscape or1080x1920for mobile) that fits your target platform. - Write Script: Describe the scene using the
characterXkeywords. Define the timeline (e.g.,[0-5s]) ensuring it matches your selectedDuration.
How Wan 2.6 compares to other models
- Vs Wan 2.5 Generation: Compared to Wan 2.5, Wan 2.6 delivers native audio-visual synchronization with precise lip-sync, stronger reference video adherence, more stable motion, and longer coherent multi-shot clips. Ideal Use Case: Choose Wan 2.6 when speech alignment and multi-scene continuity are critical.
- Vs Wan 2.2 (open-source family): Compared to Wan 2.2, Wan 2.6 delivers higher resolution (1080p vs up to 720p), built-in audio sync/lip-sync, and improved temporal consistency for production-ready results. Ideal Use Case: Use Wan 2.6 for commercial projects needing polished audio-visual outputs without additional tooling.
- Vs Seedance 1.0 Pro: Compared to Seedance 1.0 Pro, Wan 2.6 delivers native audio-video alignment in a single pass, reducing reliance on external audio editing workflows. Ideal Use Case: Select Wan 2.6 when you need immediate lip-synced dialogue or tightly timed visuals with music.
- Vs Kling Video 2.6: Compared to Kling 2.6, Wan 2.6 delivers stronger reference video generation, narrative continuity, and multiple aspect ratio workflows, while matching on 1080p output and native A/V sync. Ideal Use Case: Pick Wan 2.6 for reference-driven storytelling and consistent brand visuals across formats.
API Integration
Developers can integrate Wan 2.6 via the RunComfy API using standard HTTP requests and JSON payloads. The endpoint accepts prompt text and optional media references, returning 1080p, 24fps MP4/MOV/WebM outputs suitable for production pipelines.
Related Tools
- If you don't have a reference video and want to generate from text only: Wan 2.6 Text-to-Video
- If you only have a static image of the character: Wan 2.6 Image-to-Video
Related Playgrounds
Generate clips with fluid motion and audios for creatives
Create smooth motion clips from stills with custom camera moves.
Lightning-fast video creation with lifelike and smooth kinetics.
Create fast, audio-enhanced visuals from text prompts
Prompt-based animating with subject fidelity and smooth motion.
Frequently Asked Questions
What are the technical limitations of Wan 2.6 Video to Video regarding resolution, duration, and aspect ratios?
Wan 2.6 Video to Video currently outputs up to 1080p resolution at 24fps, supporting multiple aspect ratios including 16:9, 9:16, and 1:1. For video-to-video workflows, longer durations are achieved through multi-shot narrative chaining, and the system is optimized for short-to-moderate clips typically ranging from 5 to 15 seconds per shot.
How many reference inputs can I use with Wan 2.6 Video to Video when working with ControlNet or IP-Adapter?
In Wan 2.6 Video to Video, developers can typically include one reference video and multiple static image references per generation request. When using ControlNet-like conditioning or IP-Adapter guidance, the platform limits reference inputs to maintain model stability and performance across the video-to-video pipeline.
How do I transition from testing Wan 2.6 Video to Video in the RunComfy Playground to using the API in production?
To move from trial to production, prototype your Wan 2.6 Video to Video prompts in the RunComfy Playground, then use the same parameters within the RunComfy API by retrieving your API access key from your dashboard. The video-to-video endpoints mirror playground behavior, ensuring that production results match your web UI tests. You can find integration examples and rate-limit information in the API documentation.
How does Wan 2.6 Video to Video improve upon Wan 2.5 in terms of generation quality?
Compared to Wan 2.5, Wan 2.6 Video to Video offers enhanced motion stability, stronger character consistency across shots, better lip-sync accuracy, and native audio generation. The video-to-video pipeline is also more stable with improved temporal coherence and finer detail handling, resulting in smoother and more realistic visual output.
What makes Wan 2.6 Video to Video stand out from competitors like Kling Video 2.6 or Seedance 1.0 Pro?
Wan 2.6 Video to Video differentiates itself with integrated audio-visual sync, precise lip-sync, and reference video-guided motion transfer in video-to-video mode. Unlike Seedance, which may require separate audio production, Wan 2.6 produces synchronized audio in a single pass, making it a more integrated tool for production environments.
Does Wan 2.6 Video to Video support multilingual text prompts and audio generation?
Yes, Wan 2.6 Video to Video supports multilingual inputs and outputs, automatically generating lip-synced speech and localized audio content across supported languages. This multilingual capability extends through text-to-video and video-to-video modes, enabling globalized storytelling from a single interface.
Can Wan 2.6 Video to Video generate audio automatically in the video-to-video process?
Wan 2.6 Video to Video includes native audio generation for voiceovers, background music, and sound effects aligned with the visuals. The model’s multimodal architecture ensures precise audio-video synchronization without manual sound design intervention during the video-to-video conversion.
What kind of content is Wan 2.6 Video to Video best suited for?
Wan 2.6 Video to Video excels in short-form visual storytelling such as ads, social media clips, explainers, product spots, and training videos. Its video-to-video mode allows creators to build coherent multi-shot narratives with consistent characters and motion across scenes, ideal for marketing and e-learning use cases.
Is there a commercial license included with Wan 2.6 Video to Video?
Wan 2.6 Video to Video provides users with commercial-use rights for generated outputs via RunComfy, but you should review Wan AI’s official licensing terms on wan2-6.com for any jurisdictional or use-specific restrictions before deploying video-to-video productions commercially.
What hardware or execution environment is required to run Wan 2.6 Video to Video efficiently?
Wan 2.6 Video to Video runs entirely in the RunComfy cloud environment, so no local GPU is required. For developers using the video-to-video API, latency and throughput depend on your selected model size—5B for faster output or 14B for higher fidelity video generation.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.
