AI Toolkit LoRA Training Guides

Wan 2.2 I2V 14B Image-to-Video LoRA Training with Ostris AI Toolkit

This guide walks you through training Wan 2.2 I2V 14B image-to-video LoRAs using the Ostris AI Toolkit. You'll learn how Wan's dual high-noise and low-noise experts work, how to design motion, style and character clip datasets, and how to configure Multi-stage, Num Frames, resolution buckets and quantization so I2V LoRAs run reliably on 24GB local GPUs or H100/H200 machines.

Train Diffusion Models with Ostris AI Toolkit

Ostris AI ToolkitOstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Datasets

Dataset 1

Sample

Wan 2.2 I2V 14B is the image‑to‑video model in the Wan 2.2 family, turning a single image into 5‑second clips with controllable motion, camera moves and temporal consistency. By the end of this guide, you’ll be able to:

  • Design Wan I2V LoRA datasets for motion, style and character use cases (and know how many clips you actually need).
  • Understand how Wan’s dual high-noise / low-noise experts, timestep settings, Num Frames and resolution interact during training.
  • Configure the AI Toolkit panels (JOB, MODEL, QUANTIZATION, MULTISTAGE, TARGET, TRAINING, DATASETS, SAMPLE) for stable 24GB runs and for larger H100/H200 cloud setups.
  • Export your Wan 2.2 I2V LoRA and plug it into ComfyUI or the Wan I2V LoRA playground for real-world projects.
This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this FLUX.2 [dev] guide.

Table of contents


1. What makes Wan 2.2 I2V 14B special?

Wan 2.2 I2V 14B ("A14B") is the image‑to‑video variant of Wan 2.2. Architecturally it is a dual‑stage Mixture‑of‑Experts (MoE) transformer. There are two separate 14B‑parameter transformers. The high‑noise transformer handles early, very noisy timesteps and is responsible for global composition, motion trajectory, and camera movement. The low‑noise transformer handles late, clean timesteps and is responsible for fine detail, identity, and texture.

At inference time, the pipeline splits timesteps around a boundary at roughly 875/1000 of the noise schedule and routes them to the high‑ or low‑noise transformer. In practice, each expert handles about half of the denoising process. Wan 2.2 I2V generates up to 81 frames at 16 FPS, which is about 5 seconds of video.

For LoRA training this has three key consequences. You can choose to train one or both stages. You can bias training toward composition and motion (high noise) or toward identity and detail (low noise). And because you process sequences of frames, frame count, resolution, VRAM and quantization/offloading settings matter much more than they do for an image‑only model.

AI Toolkit exposes these controls mainly through the MULTISTAGE, TRAINING, TARGET, and DATASETS panels.


2. Where to run Wan 2.2 I2V training

You can run this Wan 2.2 I2V LoRA training workflow either on the cloud AI Toolkit on RunComfy or on a local AI Toolkit install. The UI and panels are the same; only the hardware changes.

2.1 RunComfy cloud AI Toolkit (recommended for first runs)

If you don’t want to manage CUDA, drivers, or large model downloads, use the cloud AI Toolkit on RunComfy:

👉 RunComfy AI Toolkit trainer

On that page you get the AI Toolkit UI pre‑installed in the browser. You can upload datasets, configure jobs exactly as in this guide, and run training on an H100 (80 GB) or H200 (141 GB) GPU. This is the easiest way to reproduce the tutorial reliably without touching local setup.


2.2 Local AI Toolkit

If you prefer to run locally: Install the AI Toolkit repository following the README (Python + PyTorch for training and Node for the UI), then run the UI (npm run build_and_start in ui/). Open http://localhost:8675 and you’ll see the same panels as in the screenshots and descriptions here.


3. Dataset design for Wan I2V LoRAs

Wan 2.2 I2V is trained on video‑clip + caption pairs. Each training sample is a sequence of frames plus text. In AI Toolkit you do not need to manually cut every clip to the same length. Instead you configure Num Frames in the DATASETS panel and the data loader will evenly sample that many frames from each video, automatically handling clips of different durations.

3.1 Decide what kind of LoRA you’re training

How you set hyper‑parameters depends heavily on your goal:

  • Motion / camera LoRA focuses on patterns like "orbit 360 around subject", "slow dolly zoom", "hand‑held jitter", or specific action beats.
  • Style LoRA makes videos look like a particular film stock, anime style, or painterly look, while still keeping Wan’s base motion and scene composition.
  • Character LoRA tries to preserve a specific character or face consistently across many scenes and motions.

Wan 2.2 I2V can do all three. Motion LoRAs lean more heavily on the high‑noise stage, while style and character LoRAs lean more on the low‑noise stage plus very consistent visuals.


3.2 Video clips and cropping

Use real video clips (.mp4, .mov, etc.), not GIFs. Clip length can vary (for example 5–30 seconds). AI Toolkit will evenly sample training frames along each clip according to your Num Frames setting.

The one thing you should always do by hand is to crop and trim each clip so that the motion you care about starts quickly and there is not a lot of "standing around" at the beginning or end. For motion LoRAs in particular, you want the motion to occupy almost the entire clip — for example, the full orbit, the full dolly move, or the full gesture.


3.3 How many clips do you need?

As a rough rule of thumb:

  • A simple motion LoRA that teaches a single type of camera move usually trains well on 10–30 short clips (~3–8s) where the target motion is very clear and occupies most of the frame.
  • A style LoRA typically needs 10–40 images or clips that cover different scenes, lighting, and subjects, but all share the same look and colour treatment.
  • A character LoRA on I2V behaves more like an image LoRA. As a minimum, aim for 10–30 short clips of the same character, with varied poses, scales, angles, and backgrounds; if you can comfortably reach 20–40 clips, likeness and robustness usually improve.

3.4 Captions for I2V clips

Each video file can optionally have a .txt caption with the same base name (for example castle_orbit.mp4 and castle_orbit.txt). AI Toolkit also supports a Default Caption that is used whenever a clip has no per‑file caption.

Good caption patterns:

  • For a motion LoRA, encode the motion explicitly in the text, for example:

    orbit 360 around the subject, orbit 180 around the subject, or slow dolly in toward the character.

  • For a style LoRA, describe the look, not the scene content, for example:

    grainy 16mm film look, high contrast, warm tint.

  • For a character LoRA, include a trigger word plus a class, for example:

    frung, young woman, casual clothing (where frung is your trigger token).

You can also combine a Trigger Word set in the JOB panel with captions that contain [trigger]. AI Toolkit will replace [trigger] with your chosen trigger string when loading the dataset so you don’t have to hard‑code the trigger name in every caption.


4. Wan 2.2 I2V specifics you need to understand

4.1 High‑noise vs low‑noise transformers

Wan I2V’s two transformers behave roughly like this:

The high‑noise transformer operates at timesteps near the start of the diffusion process (approximately 1000 down to ~875). It sets up global composition and coarse shapes and decides where objects go, how the camera moves, and what the motion trajectory will be. It is critical for motion and layout.

The low‑noise transformer runs at timesteps from about 875 down to 0. It refines details, textures, face likeness, and micro‑motions. It is critical for identity, texture, and sharpness.

In practice, training only the high‑noise stage can teach new kinds of movement and composition but tends to under‑train detail. Training only the low‑noise stage struggles to significantly change motion or layout at all. For most LoRAs you should train both stages and then steer emphasis using Timestep Bias in the TRAINING panel.


4.2 Frames, FPS and speed

Wan 2.2 I2V 14B can generate up to 81 frames at 16 FPS, which is 5 seconds. In practice, valid video frame counts follow the "4n+1" rule (for example 9, 13, 17, 21, 33, 41, 81…). You can think of video lengths in that family; 1 frame is also supported and effectively reduces I2V to a single‑frame image‑like mode for LoRA training.

In AI Toolkit there are two separate Num Frames knobs. Num Frames in the DATASETS panel controls how many frames per clip are sampled for training. Num Frames in the SAMPLE panel controls how long your preview videos are. They do not have to match exactly, but keeping them similar makes behaviour easier to reason about.

A good starting point for training is 41 frames (around 2.5 seconds). On 80–96 GB GPUs (H100‑class) you can go up to the full 81-frame configuration. Shorter lengths such as 21 or 33 frames can be used to reduce VRAM load and step time on small GPUs, at the cost of capturing less temporal context.


4.3 Resolution and pixel area

Wan’s official demos tend to keep the effective area around 480×832 ≈ 400k pixels, and the Hugging Face spaces snap dimensions to multiples of 16 or 32.

For LoRA training with AI Toolkit:

  • On a 24 GB GPU, use resolution buckets like 512 and 768. Avoid 1024×1024 unless you are very aggressively quantized and/or using layer offloading; video at 1024² plus 41–81 frames is heavy.
  • On 48 GB+ GPUs or H100/H200, you can safely add a 1024 bucket and even use cinematic widescreen resolutions centred around values like 1024×576, 1024×608, or 1024×640.

AI Toolkit will automatically bucket and downscale your videos into the selected resolutions when loading the dataset.


5. Step‑by‑step: configure a Wan 2.2 I2V 14B LoRA in AI Toolkit

We assume you have at least a 24 GB‑class GPU, so the settings below are a safe baseline. If you’re on a larger card or using the cloud AI Toolkit on RunComfy, some panels also include a short note on how to scale the settings up.


5.1 JOB panel

In the JOB panel you set basic metadata and, optionally, a trigger token.

  • Training Name

    Use any descriptive name; it becomes the folder name for checkpoints and samples. Examples: wan_i2v_orbit_v1, wan_i2v_style_neon, wan_i2v_char_frung_v1.

  • GPU ID

    On a local install this points to your physical GPU. On the RunComfy cloud AI Toolkit you can leave this as default; the actual machine type (H100/H200) is chosen later in the Training Queue.

  • Trigger Word (optional)

    Use a trigger for character or style LoRAs where you want a dedicated token such as frung or wan_cam_orbit. If your dataset captions contain [trigger], AI Toolkit will substitute your Trigger Word value into those captions automatically at load time.

    For pure motion LoRAs, you often do not need a trigger word because the behaviour is already encoded in phrases like "orbit 360 around the subject". For characters and styles, it is strongly recommended to use a trigger so you have a clean on/off switch for your LoRA later.


5.2 MODEL and QUANTIZATION panels

These panels control which Wan model checkpoint is used and how aggressively it is quantized.

MODEL panel

  • Model Architecture

    Select Wan 2.2 I2V (14B).

  • Name or Path

    lets you override the default Hugging Face / model hub path for ai-toolkit/Wan2.2-I2V-A14B-Diffusers-bf16. Leave it blank or at the default value and AI Toolkit will download the recommended base model from Hugging Face. Or point it to a local path if you want to use a custom Wan 2.2 checkpoint.

  • Low VRAM

    Turn Low VRAM ON for 24 GB consumer GPUs or any card that is also driving your display. Turn it OFF on 48 GB+ cards or on H100/H200 in the cloud where you have more VRAM. When Low VRAM is on, AI Toolkit applies extra memory‑saving tricks (more quantization and checkpointing) to fit the model.

  • Layer Offloading

    This toggle streams parts of the model to CPU RAM instead of keeping all layers resident in VRAM. It is only necessary if you are trying to run Wan I2V on a very small GPU (around 10–12 GB VRAM) and have a lot of system RAM (64 GB+). It can roughly double step time but can bring peak VRAM below ~9 GB. For 24 GB GPUs, start with Layer Offloading OFF and only turn it on if you still hit out‑of‑memory errors.

On big GPUs / RunComfy:

On 48 GB+ or on H100/H200, set Low VRAM OFF and Layer Offloading OFF so all layers stay resident on the GPU and steps are as fast as possible.

QUANTIZATION panel

  • Transformer

    On 24–32 GB GPUs, set Transformer to 4bit with ARA. This uses a 4‑bit quantization together with an Accuracy Recovery Adapter so that VRAM usage is close to plain 4‑bit while quality stays much closer to bf16.

  • Text Encoder

    Set Text Encoder to float8 (or qfloat8)**. This reduces VRAM and compute for the text encoder with negligible impact on Wan 2.2 I2V LoRA quality.

This mirrors the official AI Toolkit example configs for Wan 2.2 video LoRAs and is the main reason training is practical on 24 GB cards. If you run into stability issues or severe slow‑downs with ARA on a particular setup, you can fall back to qfloat8 for the Transformer as well; it uses more VRAM but behaves very similarly in terms of quality.

On big GPUs / RunComfy:

On an H100/H200 or a 48–96 GB workstation card, you can either keep 4bit with ARA and spend the extra VRAM on higher resolution, more frames, or a higher LoRA rank, or switch the Transformer to a pure float8 / qfloat8 option for a simpler stack. Going all the way back to full bf16 everywhere is rarely necessary for LoRA training.


5.3 MULTISTAGE panel (high / low noise)

The MULTISTAGE panel lets you decide which Wan expert(s) to train and how often the trainer switches between them.

  • Stages to Train

    Keep both High Noise and Low Noise set to ON for most LoRAs. High noise controls composition and motion; low noise controls detail and identity.

  • Switch Every

    This value controls how many steps you run on one expert before swapping to the other. With High Noise = ON, Low Noise = ON, Switch Every = 10, and Steps = 3000, AI Toolkit trains:

    • Steps 1–10 on the high‑noise transformer,
    • Steps 11–20 on the low‑noise transformer,
    • and repeats this alternation until training is done.

On large GPUs without Low VRAM you can use Switch Every = 1, meaning you alternate every step. With Low VRAM ON, however, each switch forces AI Toolkit to unload one transformer and load the other; switching every step becomes very slow.

For a 24 GB GPU baseline, use:

  • High Noise = ON
  • Low Noise = ON
  • Switch Every = 10

On big GPUs / RunComfy:

With Low VRAM disabled and enough VRAM for both experts, you can set Switch Every = 1 for slightly smoother alternation.


5.4 TARGET panel (LoRA network settings)

In the TARGET panel you configure what kind of adapter you are training and how "wide" it is.

  • Target Type

    Set Target Type to LoRA.

  • Linear Rank

    Linear Rank controls LoRA capacity per block. Higher rank increases capacity but also VRAM usage and the risk of overfitting. For Wan 2.2 I2V, practical defaults are:

    • Motion and camera LoRAs: Rank 16 is usually enough because they modify behaviour more than tiny visual details.
    • Style LoRAs: Rank 16 or 32 depending on how complex and varied the style is.
    • Character LoRAs: Rank 16 on 24 GB cards, and 16–32 on high‑VRAM GPUs; use 32 when you have headroom and need extra capacity for close‑up, high‑res faces.

On very large GPUs, Rank 32 is a good default for rich styles and demanding character work, but it is not required just to get a LoRA running.


5.5 SAVE panel

The SAVE panel controls how often checkpoints are written and in what precision.

  • Data Type

    Use BF16 or FP16. Both are fine for LoRAs. BF16 is slightly more numerically stable on modern GPUs.

  • Save Every

    Set Save Every to around 250. This gives you a checkpoint every 250 steps.

  • Max Step Saves to Keep

    Set Max Step Saves to Keep between 4 and 6. This keeps disk usage under control while still leaving you some earlier checkpoints to fall back to.

You do not have to use the last checkpoint. Very often the best‑looking samples come from somewhere around 2000–4000 steps. The SAMPLE panel configuration below explains how to judge this.


5.6 TRAINING panel

The TRAINING panel holds most of the important knobs: batch size, learning rate, timesteps, loss, and text encoder handling.

Core hyper‑parameters

Configure the core training settings like this for a 24 GB Wan I2V video LoRA:

  • Batch Size

    Start with 1. Video models are heavy, and 1 is realistic even on 24 GB cards. On H100/H200 you can later experiment with batch sizes of 2–4.

  • Gradient Accumulation

    Leave Gradient Accumulation at 1 initially. Effective batch size is batch size times gradient accumulation. You can raise it to 2 or 4 if VRAM is extremely tight and you want a slightly larger effective batch, but gains are modest for video.

  • Learning Rate

    Start with Learning Rate = 0.0001. This is the default in AI Toolkit examples and is stable for Wan LoRAs. If training looks noisy or the LoRA overshoots quickly, you can reduce to 0.00005** mid‑run and resume from the latest checkpoint.

  • Steps – typical ranges:
    • Small, focused motion LoRA with ~10–20 clips: 1500–2500 steps.
    • Character or style LoRA with 20–50 clips: 2000–3000 steps.
    • Very large datasets can go higher, but it is usually better to improve data quality (captions, diversity) than to push far beyond 3000–4000 steps.
  • Weight Decay

    Leave Weight Decay at 0.0001 unless you have a specific reason to change it; it provides mild regularization.

  • Loss Type

    Keep Loss Type as Mean Squared Error (MSE)**. Wan 2.2 uses a flow‑matching noise scheduler, and MSE is the standard loss for this setup.

Timesteps and scheduler

Timesteps and scheduler

  • Timestep Type

    For Wan 2.2 I2V, Linear is the default Timestep Type and works well for most LoRA types. It spreads updates evenly along the flow‑matching schedule and plays nicely with the split between the high‑noise and low‑noise experts.

  • Timestep Bias

    Timestep Bias controls which part of the trajectory you emphasise:

    • Balanced – updates are spread across high‑ and low‑noise timesteps; this is the safe default for all LoRA types.
    • Favor High Noise – focuses more on early, noisy steps where Wan decides global layout, motion and colour.
    • Favor Low Noise – focuses more on late, clean steps where fine detail and identity live.
    • Motion / camera LoRAs – start with Timestep Type = Linear, Timestep Bias = Balanced. For very "pure" camera‑move LoRAs you can experiment with Favor High Noise to lean harder on the high‑noise expert.
    • Style LoRAs – use Timestep Type = Linear (or Shift) and Timestep Bias = Favor High Noise, so the LoRA rewrites global tone and colour while the base model still handles late‑stage details.
    • Character LoRAs – use Timestep Type = Sigmoid (or Linear) and Timestep Bias = Balanced. Identity and likeness depend more on low‑noise steps, but keeping the bias Balanced lets both experts contribute; only if you specifically want extra focus on micro‑detail should you try a slight low‑noise bias.

Under the hood, Wan 2.2 I2V uses a flow‑matching noise scheduler. AI Toolkit sets the scheduler and matching sampler automatically for the Wan 2.2 architecture, so you mainly steer behaviour via Timestep Type, Timestep Bias and the Multi‑stage settings above.

EMA (Exponential Moving Average)

  • Use EMA

    For LoRAs, EMA is optional and consumes extra VRAM and time. Most Wan LoRA users leave Use EMA OFF and it is rarely needed unless you are doing full‑model finetunes.

Text Encoder Optimizations

At the bottom of the TRAINING panel are the Text Encoder Optimizations settings. They control how aggressively the text encoder is offloaded or cached.

  • Unload TE

    This mode unloads the text encoder weights so they no longer consume VRAM between steps. For Wan 2.2 I2V LoRAs you almost always rely on rich per‑clip captions, so you should keep Unload TE OFF in normal caption‑based training. Only consider Unload TE if you are deliberately training a very narrow "trigger‑only / blank prompt" LoRA that does not use dataset captions at all.

  • Cache Text Embeddings

    This option pre‑computes caption embeddings once and reuses them, avoiding repeated text encoder passes. Turn Cache Text Embeddings ON only when your captions are static and you are not using features that modify or randomize the prompt each step, such as Differential Output Preservation, dynamic [trigger] rewriting in captions, or anything that heavily depends on caption dropout behaviour. In that case, AI Toolkit encodes all training captions once, caches the embeddings to disk, and can drop the text encoder out of VRAM.

If you plan to use DOP, Caption Dropout, or any other dynamic prompt tricks, keep Cache Text Embeddings OFF so the text encoder can re‑encode the real prompt every batch. The Differential Output Preservation and Datasets sections explain these interactions in more detail.

Regularization – Differential Output Preservation (DOP)

The Regularization section exposes Differential Output Preservation (DOP), which helps the LoRA behave like a residual edit instead of overwriting the base model.

DOP compares the base model’s output (without LoRA) to the LoRA‑enabled output and adds a penalty when the LoRA changes aspects unrelated to your target concept. It tries to teach "what changes when the trigger is present" rather than "re‑train the entire model".

For motion / camera LoRAs, you usually do not need DOP, because motion behaviour is already fairly localized. Enabling DOP roughly doubles compute by adding extra forward passes.

For style and character LoRAs, DOP is often very helpful for keeping Wan’s strong base realism intact. A good starting configuration is:

  • Differential Output Preservation: ON
  • DOP Loss Multiplier: 1
  • DOP Preservation Class: person for character LoRAs, or an appropriate class such as scene or landscape for style LoRAs if your build provides those options.

Important compatibility note: Differential Output Preservation rewrites or augments the prompt text each step (for example by swapping your trigger word for the preservation class word). Because of this, DOP is not compatible with Cache Text Embeddings. If you turn DOP ON, make sure Cache Text Embeddings is OFF so the text encoder sees the updated prompt every batch.


5.7 ADVANCED panel (Differential Guidance)

If your AI Toolkit build exposes the ADVANCED panel for this model, it may include Do Differential Guidance and Differential Guidance Scale.

Differential Guidance computes "with LoRA" vs "without LoRA" predictions and nudges training towards the difference between them, similar in spirit to DOP but implemented at the guidance level instead of as a separate loss term.

Practical recommendations:

  • Turn Do Differential Guidance ON with a Differential Guidance Scale around 3 for targeted edit‑style LoRAs (for example "make the camera orbit", "apply neon style") where you want the LoRA to behave like a cleaner modifier.
  • For very broad, heavy style LoRAs that rewrite the entire look, you can try lower scales (1–2) or leave it OFF if the LoRA feels too weak.

If you are tight on compute, you can safely leave Differential Guidance OFF for your first runs and experiment later.


5.8 DATASETS panel

Each dataset block in AI Toolkit maps to one entry in the datasets: list, but in the UI you simply configure one or more dataset cards.

A typical single Wan I2V dataset configuration looks like this:

  • Target Dataset

    Choose your uploaded Wan I2V video dataset folder, for example wan_orbit_clips.

  • Default Caption

    This caption is used when a clip has no .txt caption file. Examples:

    Motion LoRA: orbit 360 around the subject

    Style LoRA: cinematic neon cyberpunk style

    Character LoRA: frung, person, portrait (where frung is your trigger token).

  • Caption Dropout Rate

    This is the probability that the caption is dropped (replaced by an empty caption) for a training sample. For Wan I2V LoRAs, a small amount of dropout encourages the model to use both visual context and text. A typical starting range is 0.05–0.10 (5–10%) when the text encoder stays loaded. If you decide to enable Cache Text Embeddings in the TRAINING panel, it is often simpler to set Caption Dropout Rate = 0 so you avoid a subset of clips permanently having no caption.

  • LoRA Weight

    Usually set to 1. You only change this when mixing multiple datasets and you want one dataset to count more or less in training.

  • Settings → Cache Latents

    Turning Cache Latents ON encodes each clip’s frames into latents once and reuses them for training, which can significantly speed up training and remove the VAE from VRAM. For video this can consume disk space and I/O bandwidth, so use it when your storage and RAM budget allow.

  • Settings → Is Regularization

    Leave Is Regularization OFF for your main dataset. If you add a separate regularization dataset later, you would set that dataset’s Is Regularization to ON.

  • Flipping

    Flip X and Flip Y mirror frames horizontally or vertically. For most video tasks you should keep both OFF, especially for motion LoRAs where flipping can invert left/right motion semantics or for characters with asymmetric features. For purely style‑only LoRAs you can experiment with Flip X to increase variation.

  • Resolutions

    Choose one or more resolution buckets. On a 24 GB GPU you typically enable 512 and 768 and leave 1024 disabled. On 48 GB+ or H100/H200, you can enable 512, 768, and 1024. AI Toolkit will automatically assign clips to the nearest bucket and downscale as needed.

  • Num Frames

    Set Num Frames to the number of frames per clip you want to sample for training. A good starting point is 41. On very small GPUs (10–12 GB) with heavy quantization and offloading, you can reduce this to 21 or even 9 just to get training running, at the cost of shorter temporal context.

If you need multiple datasets (for example, a main motion dataset plus a small "style" dataset), you can add them all in the DATASETS panel and use LoRA Weight plus the Is Regularization flag to control their relative influence.


5.9 SAMPLE panel (training previews)

The SAMPLE panel does not influence training directly; it controls how AI Toolkit periodically generates preview videos so you can pick the best checkpoint.

Configure high‑level sampling settings like this:

  • Sample Every

    Set Sample Every to 250. This matches the Save Every setting so each checkpoint has a corresponding set of preview videos.

  • Sampler

    Use a sampler compatible with Wan’s flow‑matching scheduler, typically shown as FlowMatch or similar in your build.

  • Width / Height

    On 24 GB GPUs, use something like 768 × 768 or a vertical format such as 704 × 1280 for samples. Avoid 1024×1024 preview videos unless you are comfortable with slower sampling; training itself does not require 1024² previews.

  • Guidance Scale

    Start with a Guidance Scale around 3.5–4, which matches many Wan 2.2 demo configs.

  • Sample Steps

    Set Sample Steps to 25. More steps rarely change motion quality dramatically and mostly increase time.

  • Seed / Walk Seed

    Set a fixed Seed like 42. Turn Walk Seed ON if you want each preview to get a different seed while still being clustered near the original.

  • Num Frames

    Set Num Frames in the SAMPLE panel equal or close to your training value. If you trained with 41 frames, sample with 41 as well. Once the LoRA looks good, you can test generalisation by generating longer clips at 81 frames; training at 41 often generalises surprisingly well to 81‑frame inference.

  • FPS

    Usually keep FPS = 16. Changing FPS only affects playback speed, not the learned motion itself.

For prompts, add 2–4 prompt rows that mirror your training distribution. For each row, attach a control image similar to what you’ll use at inference.


6. Wan 2.2 I2V 14B LoRA Training Settings for Motion, Style and Character

Here are quick recipes for common Wan 2.2 I2V LoRA types. Treat these as starting points and adjust based on previews.

6.1 Motion / camera LoRA

Goal: teach Wan a new motion like orbit 360, orbit 180, or a specific camera swing.

Use 10–30 short clips (~3–8s) where the target motion is very clear and occupies most of the clip. Captions should explicitly describe the motion, for example orbit 180 around the subject or orbit 360 around a futuristic city.

Panel guidelines:

  • MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10.
  • TARGET: Linear Rank = 16.
  • TRAINING: Learning Rate = 0.0001, Steps ≈ 1500–2500, Timestep Type = Linear, Timestep Bias = Balanced, DOP OFF.
  • DATASETS: Resolutions at 512/768, Num Frames = 33–41 on 24 GB (up to 81 on H100/H200), Caption Dropout Rate ≈ 0.05–0.1.

Train with Save Every = 250 and Sample Every = 250. When inspecting samples, focus on whether the target motion is stable across different prompts and scenes; if it only works on near‑duplicates of your training clips, prefer improving data diversity or slightly increasing steps over pushing the bias away from Balanced.


6.2 Style LoRA (video look / grade)

Goal: change visual style while respecting Wan’s base motion and composition.

Use 10–40 images or clips that all share the same look but cover diverse scenes and subjects, for example grainy 16mm film look, high contrast, warm tint.

Panel guidelines:

  • MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10.
  • TARGET: Linear Rank = 16 for simple styles; 16–32 for complex or cinematic looks.
  • TRAINING: Learning Rate = 0.0001, Steps ≈ 1500–2500, Timestep Type = Linear (or Shift), Timestep Bias = Favor High Noise.
  • Regularization (DOP): Differential Output Preservation ON, DOP Loss Multiplier = 1, DOP Preservation Class matching your dominant subject (often person or scene), Cache Text Embeddings = OFF.
  • DATASETS: Resolutions 512/768 on 24 GB (and 768–1024 on big GPUs), Num Frames = 33–41 on 24 GB, Caption Dropout Rate around 0.05 if Cache Text Embeddings is OFF.

Watch for whether the style applies consistently across scenes and lighting. If it starts to overpower content or make everything look the same, try lowering the learning rate mid‑run, stepping back to an earlier checkpoint, or reducing the LoRA rank.


6.3 Character LoRA (video likeness)

Character LoRAs on I2V are more challenging than on text‑to‑image models, but they are feasible.

Use 10–30 short clips of the same character in varied poses, scales, angles, and backgrounds; captions should always include your Trigger Word plus a class, for example frung, young woman, casual clothing. If you can gather 20–40 clips, identity robustness usually improves, but it is not strictly required to get usable results.

Panel guidelines:

  • MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10.
  • TARGET: Linear Rank = 16 on 24 GB; 16–32 on high‑VRAM GPUs (use 32 when you have headroom and care about close‑up, high‑res faces).
  • TRAINING: Learning Rate = 0.0001, Steps ≈ 2000–3000, Timestep Type = Sigmoid (or Linear), Timestep Bias = Balanced.
  • Regularization (DOP): Differential Output Preservation ON, DOP Loss Multiplier = 1, DOP Preservation Class = person.
  • DATASETS: Resolutions as high as your hardware allows (512/768 on 24 GB; add 1024 on big GPUs), Num Frames = 33–41 on 24 GB, or 41–81 on H100/H200.

Community experience suggests that identity and likeness lean more on the low‑noise expert, but keeping Timestep Bias = Balanced and using a shaped Timestep Type (Sigmoid) usually gives a better trade‑off between likeness and overall video stability than hard‑biasing toward low noise.


7. Troubleshooting common Wan I2V LoRA issues

Motion too fast compared to source

This usually happens if you trained with fewer frames per clip than your inference setting. For example, you might have trained at 21 or 41 frames but you’re sampling at 81 frames with FPS fixed at 16. The same motion gets "stretched" differently.

You can fix this by lowering FPS in the SAMPLE panel (for playback only), or by training and sampling at a consistent Num Frames such as 41 so temporal behaviour is more predictable.


Camera doesn’t move or composition barely changes

If the camera barely moves or composition looks like the base model:

Check that you are actually training the high‑noise stage and that Timestep Bias is not set too strongly toward low timesteps. Make sure High Noise is ON in the MULTISTAGE panel and Timestep Bias is Favor High for motion LoRAs. Also check that captions clearly describe the desired motion; Wan cannot learn motion that is neither visible nor named.


Details and faces look worse than base Wan

If your LoRA removes detail or worsens faces:

Try increasing Linear Rank slightly (for example from 16 to 32) and favouring low noise in the Timestep Bias so more training signal lands on late timesteps where identity and detail live. You can also lower the learning rate and resume from an earlier checkpoint.


LoRA overfits and only works on training‑like scenes

If the LoRA only looks correct on scenes very similar to the training data:

Reduce the total number of Steps (for example from 5000 down to 3000), increase dataset diversity, and consider enabling Differential Output Preservation if it is currently off. If DOP is already ON and the effect is still too narrow, slightly lower the LoRA rank and/or learning rate.


VRAM out‑of‑memory errors

If training frequently runs out of VRAM:

Reduce any combination of:

  • resolution buckets (drop 1024 and keep 512/768),
  • Num Frames (for example from 41 down to 21),
  • batch size (keep it at 1 if it isn’t already).

Turn Low VRAM ON, turn Layer Offloading ON if you only have 10–12 GB VRAM and plenty of system RAM, and make sure quantization is set to float8 for both the transformer and text encoder in the QUANTIZATION panel. If local VRAM is still not enough, consider running the same AI Toolkit job on RunComfy’s cloud with an H100 or H200 GPU, where you can keep settings much simpler.

👉 RunComfy AI Toolkit trainer


8. Export and use your Wan I2V LoRA

Once training is complete, you can use your Wan 2.2 I2V 14B LoRA in two simple ways:

  • Model playground – open the Wan 2.2 I2V 14B LoRA playground and paste the URL of your trained LoRA to quickly see how it behaves on top of the base model.
  • ComfyUI workflows – start a ComfyUI instance, build a workflow, plug in your LoRA, and fine‑tune its weight and other settings for more detailed control.

More AI Toolkit LoRA training guides

Ready to start training?