Wan 2.6: High-Fidelity video-to-video Lip-Sync & Motion Transfer on playground and API | RunComfy

wan-ai/wan-2-6/video-to-video

Wan 2.6 transforms reference videos and prompts into 1080p, 24fps clips with realistic lip-sync, multi-shot storytelling, and precise motion transfer for easy, production-ready video creation.

Length should be less than 1500 characters. According to the order of the reference videos, refer to the reference videos as characterX. For example, when there are two reference videos, the prompt should be written as character1 singing on the side of the road and character2 dancing next to him. When there is a single reference, you also need to write character1 with 1.
Video format must be: mp4, mov. The duration of this video must be between 2s and 30s. The file size must be less than 100 MB. When you only use this input, the longest reference is 5s.
Video format must be: mp4, mov. The duration of this video must be between 2s and 30s. The file size must be less than 100 MB. When you use this input and Reference Video 1, the longest reference is 2.5s each.
Video format must be: mp4, mov. The duration of this video must be between 2s and 30s. The file size must be less than 100 MB. When you use this input and Reference Video 1 and Reference Video 2, the longest reference is 1.66s each.
Shot Type
Idle
[CRAZY LOW PRICE]: $0.05 per second for 720p+, $0.08 per second for 1080p+

Introduction to Wan 2.6 Video

Wan AI's Wan 2.6 video to video converts reference videos and prompts into 1080p, 24fps clips up to 15s with selectable 16:9, 9:16, or 1:1 ratios and precise, native audio-visual lip-sync. Trading manual frame editing, shot-by-shot storyboarding, and separate dubbing for multishot auto-story expansion and reference-accurate voice and motion transfer, Wan 2.6 streamlines production by eliminating complex masking and re-timing, built for marketing agencies, e-commerce teams, filmmakers, educators, and corporate communications. For developers, Wan 2.6 on RunComfy can be used both in the browser and via an HTTP API, so you don’t need to host or scale the model yourself.
Ideal for: Multishot Narrative Prototyping | Brand-True Product Videos with Lip-Sync | Reference-Accurate Character and Motion Transfer

Examples of Wan 2.6

Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...

Wan 2.6 Video to Video on X: Content Drops And Insights

Model Overview


  • Provider: Wan AI
  • Task: video-to-video (Multi-Reference)
  • Max Resolution/Duration: Up to 1920x1080; 5s or 10s clips
  • Summary: Wan 2.6 Video to Video is a specialized "Multi-Subject Identity Cloning" engine. Unlike standard style transfer tools, this model allows you to upload up to 3 distinct reference videos (e.g., a person, a pet, and an object) and direct them to perform entirely new actions in a generated scene. By using the specific keyword syntax (character1, character2, etc.), you can choreograph complex interactions between these reference subjects while locking their visual appearance and audio characteristics.

Key Capabilities


1. Multi-Subject Identity Locking

  • Up to 3 References: You can upload 1 to 3 reference videos (reference_video_url_1 to 3). The model extracts the identity from each and maps them to specific keywords in your prompt.
  • Mapping Logic:

- Reference Video 1 = character1

- Reference Video 2 = character2

  • Interactions: You can write a script where character1 (e.g., a singer) interacts with character2 (e.g., a dancer) in a completely new environment.

2. Multi-Shot Narrative Control

  • Shot Type Control: Includes a shot_type parameter. Set it to multi to enable dynamic camera cuts (e.g., "Shot 1", "Shot 2") within a single video, or single for a continuous take.
  • Timeline Scripting: Supports accurate time-based prompting (e.g., "[0-4s] character1 walks, [4-7s] close up of character1") to direct the acting performance.

3. Audio & Visual Fidelity

  • 360° Understanding: The model works best when reference videos show the subject from multiple angles (close-ups, rotations). It can also utilize the audio from the reference video to maintain color/lighting consistency or voice vibe.

Master the Prompting Syntax


CRITICAL RULE: You MUST refer to your subjects as character1, character2, or character3 in the prompt. Do not use generic names like "the man" or "the dog" if you want to lock the identity.


The Formula: [Overall Description] + [Shot #] [Timestamp] [Action involving characterX]


Example Prompts


Scenario: Single Subject (1 Reference Video)

> "Overall Description: A cinematic vintage movie scene, golden hour.

> Shot 1 [0-5s]: character1 is walking down a busy street, wearing a trench coat, looking around nervously.

> Shot 2 [5-10s]: Hard cut. Close up on character1's face as they spot something in the distance."


Scenario: Two Subjects Interaction (2 Reference Videos)

> "Overall Description: A sunny park scene.

> Shot 1 [0-4s]: character1 (the dog) is running through the grass chasing a ball.

> Shot 2 [4-10s]: character2 (the owner) laughs and claps their hands, calling out to character1."


Input Parameters


Core Configuration


ParameterTypeDefaultDescription
promptstringRequiredThe script. Must use character1, character2 syntax. Includes shot breakdown.
shot_typechoicemultimulti: Enables dynamic cuts (Shot 1, Shot 2). single: One continuous camera take.
durationchoice5Length of the generated video (5 or 10 seconds).
sizechoice1280*720Output dimensions. Includes 16:9 (1280x720, 1920x1080), 9:16 (720x1280, 1080x1920), and 1:1 (960x960, 1440x1440) options.

Reference Inputs


ParameterTypeDescription
reference_video_url_1videoRequired. Source for character1. Best if 2s-30s long, showing multiple angles. (<30MB)
reference_video_url_2videoOptional. Source for character2. (<30MB)
reference_video_url_3videoOptional. Source for character3. (<30MB)

Note on References: Using more reference videos reduces the "effective context window" for each. With 1 video, the model reads up to 5s of context. With 3 videos, it reads about 1.66s from each.


How to use Wan 2.6 Video to Video

  1. Upload References: Upload the video of your main subject to Reference Video 1. If you have a second subject, upload to Reference Video 2.
  2. Select Shot Type: Choose multi if you want to describe multiple camera angles/cuts, or single for a steady shot.
  3. Select Resolution: Choose a size (e.g., 1920x1080 for landscape or 1080x1920 for mobile) that fits your target platform.
  4. Write Script: Describe the scene using the characterX keywords. Define the timeline (e.g., [0-5s]) ensuring it matches your selected Duration.

How Wan 2.6 compares to other models


  • Vs Wan 2.5 Generation: Compared to Wan 2.5, Wan 2.6 delivers native audio-visual synchronization with precise lip-sync, stronger reference video adherence, more stable motion, and longer coherent multi-shot clips. Ideal Use Case: Choose Wan 2.6 when speech alignment and multi-scene continuity are critical.
  • Vs Wan 2.2 (open-source family): Compared to Wan 2.2, Wan 2.6 delivers higher resolution (1080p vs up to 720p), built-in audio sync/lip-sync, and improved temporal consistency for production-ready results. Ideal Use Case: Use Wan 2.6 for commercial projects needing polished audio-visual outputs without additional tooling.
  • Vs Seedance 1.0 Pro: Compared to Seedance 1.0 Pro, Wan 2.6 delivers native audio-video alignment in a single pass, reducing reliance on external audio editing workflows. Ideal Use Case: Select Wan 2.6 when you need immediate lip-synced dialogue or tightly timed visuals with music.
  • Vs Kling Video 2.6: Compared to Kling 2.6, Wan 2.6 delivers stronger reference video generation, narrative continuity, and multiple aspect ratio workflows, while matching on 1080p output and native A/V sync. Ideal Use Case: Pick Wan 2.6 for reference-driven storytelling and consistent brand visuals across formats.

API Integration


Developers can integrate Wan 2.6 via the RunComfy API using standard HTTP requests and JSON payloads. The endpoint accepts prompt text and optional media references, returning 1080p, 24fps MP4/MOV/WebM outputs suitable for production pipelines.


Related Tools

Related Playgrounds

Frequently Asked Questions

What are the technical limitations of Wan 2.6 Video to Video regarding resolution, duration, and aspect ratios?

Wan 2.6 Video to Video currently outputs up to 1080p resolution at 24fps, supporting multiple aspect ratios including 16:9, 9:16, and 1:1. For video-to-video workflows, longer durations are achieved through multi-shot narrative chaining, and the system is optimized for short-to-moderate clips typically ranging from 5 to 15 seconds per shot.

How many reference inputs can I use with Wan 2.6 Video to Video when working with ControlNet or IP-Adapter?

In Wan 2.6 Video to Video, developers can typically include one reference video and multiple static image references per generation request. When using ControlNet-like conditioning or IP-Adapter guidance, the platform limits reference inputs to maintain model stability and performance across the video-to-video pipeline.

How do I transition from testing Wan 2.6 Video to Video in the RunComfy Playground to using the API in production?

To move from trial to production, prototype your Wan 2.6 Video to Video prompts in the RunComfy Playground, then use the same parameters within the RunComfy API by retrieving your API access key from your dashboard. The video-to-video endpoints mirror playground behavior, ensuring that production results match your web UI tests. You can find integration examples and rate-limit information in the API documentation.

How does Wan 2.6 Video to Video improve upon Wan 2.5 in terms of generation quality?

Compared to Wan 2.5, Wan 2.6 Video to Video offers enhanced motion stability, stronger character consistency across shots, better lip-sync accuracy, and native audio generation. The video-to-video pipeline is also more stable with improved temporal coherence and finer detail handling, resulting in smoother and more realistic visual output.

What makes Wan 2.6 Video to Video stand out from competitors like Kling Video 2.6 or Seedance 1.0 Pro?

Wan 2.6 Video to Video differentiates itself with integrated audio-visual sync, precise lip-sync, and reference video-guided motion transfer in video-to-video mode. Unlike Seedance, which may require separate audio production, Wan 2.6 produces synchronized audio in a single pass, making it a more integrated tool for production environments.

Does Wan 2.6 Video to Video support multilingual text prompts and audio generation?

Yes, Wan 2.6 Video to Video supports multilingual inputs and outputs, automatically generating lip-synced speech and localized audio content across supported languages. This multilingual capability extends through text-to-video and video-to-video modes, enabling globalized storytelling from a single interface.

Can Wan 2.6 Video to Video generate audio automatically in the video-to-video process?

Wan 2.6 Video to Video includes native audio generation for voiceovers, background music, and sound effects aligned with the visuals. The model’s multimodal architecture ensures precise audio-video synchronization without manual sound design intervention during the video-to-video conversion.

What kind of content is Wan 2.6 Video to Video best suited for?

Wan 2.6 Video to Video excels in short-form visual storytelling such as ads, social media clips, explainers, product spots, and training videos. Its video-to-video mode allows creators to build coherent multi-shot narratives with consistent characters and motion across scenes, ideal for marketing and e-learning use cases.

Is there a commercial license included with Wan 2.6 Video to Video?

Wan 2.6 Video to Video provides users with commercial-use rights for generated outputs via RunComfy, but you should review Wan AI’s official licensing terms on wan2-6.com for any jurisdictional or use-specific restrictions before deploying video-to-video productions commercially.

What hardware or execution environment is required to run Wan 2.6 Video to Video efficiently?

Wan 2.6 Video to Video runs entirely in the RunComfy cloud environment, so no local GPU is required. For developers using the video-to-video API, latency and throughput depend on your selected model size—5B for faster output or 14B for higher fidelity video generation.

RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.