AI Toolkit LoRA Training Guides

Wan 2.2 I2V 14B Image-to-Video LoRA Training with Ostris AI Toolkit

This guide walks you through training Wan 2.2 I2V 14B image-to-video LoRAs using the Ostris AI Toolkit. You'll learn how Wan's dual high-noise and low-noise experts work, how to design motion, style and character clip datasets, and how to configure Multi-stage, Num Frames, resolution buckets and quantization so I2V LoRAs run reliably on 24GB local GPUs or H100/H200 machines.

Train Diffusion Models with Ostris AI Toolkit

Scroll horizontally to see full form

Ostris AI ToolkitOstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Datasets

Dataset 1

Sample

Wan 2.2 I2V 14B is the image‑to‑video model in the Wan 2.2 family, turning a single image into 5‑second clips with controllable motion, camera moves and temporal consistency. By the end of this guide, you’ll be able to:

  • Design Wan I2V LoRA datasets for motion, style and character use cases (and know how many clips you actually need).
  • Understand how Wan’s dual high-noise / low-noise experts, timestep settings, Num Frames and resolution interact during training.
  • Configure the AI Toolkit panels (JOB, MODEL, QUANTIZATION, MULTISTAGE, TARGET, TRAINING, DATASETS, SAMPLE) for stable 24GB runs and for larger H100/H200 cloud setups.
  • Export your Wan 2.2 I2V LoRA and plug it into ComfyUI or the Wan I2V LoRA playground for real-world projects.
This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this guide.
Important: read this before you hit “Start”.

>

In current AI Toolkit / diffusers builds, Wan 2.2 I2V LoRA training commonly fails for reasons that look unrelated (or “random”), but they cluster into a few predictable gotchas:

>

- Most common fatal issue: in‑training sampling crashes before training really starts (channel mismatch 36 vs 16). Many jobs die at step=0 on the very first preview with an error like:
RuntimeError: The size of tensor a (36) must match the size of tensor b (16) (often inside diffusers/schedulers/scheduling_unipc_multistep.py).
Workaround (recommended defaults in this guide): set Disable Sampling = ON (or Sample Every = 0) and enable Skip First Sample = ON if your UI exposes it. Validate checkpoints using a separate inference workflow after training.

>

- Multi‑frame datasets cannot use latent caching to disk. If you see:
caching latents is not supported for multi-frame datasets
set Cache Latents / Cache Latents to Disk = OFF for Wan I2V (when Num Frames > 1).

>

- OOM can still happen on H100 when buckets include 1024 and Num Frames is high. Bucketed training can occasionally sample the largest bucket and spike VRAM. Start with 512/768 only and add 1024 after your run is stable.

>

- Time expectation: Wan I2V video LoRA is slow. With 41 frames and mixed 512/768/1024 buckets, 3000 steps on H100 is typically “tens of hours” (often ~35–55 hours), and 81 frames is roughly ~2× that compute.

Table of contents


1. What makes Wan 2.2 I2V 14B special?

Wan 2.2 I2V 14B ("A14B") is the image‑to‑video variant of Wan 2.2. Architecturally it is a dual‑stage Mixture‑of‑Experts (MoE) transformer. There are two separate 14B‑parameter transformers. The high‑noise transformer handles early, very noisy timesteps and is responsible for global composition, motion trajectory, and camera movement. The low‑noise transformer handles late, clean timesteps and is responsible for fine detail, identity, and texture.

At inference time, the pipeline splits timesteps around a boundary at roughly 875/1000 of the noise schedule and routes them to the high‑ or low‑noise transformer. In practice, each expert handles about half of the denoising process. Wan 2.2 I2V generates up to 81 frames at 16 FPS, which is about 5 seconds of video.

For LoRA training this has three key consequences. You can choose to train one or both stages. You can bias training toward composition and motion (high noise) or toward identity and detail (low noise). And because you process sequences of frames, frame count, resolution, VRAM and quantization/offloading settings matter much more than they do for an image‑only model.

AI Toolkit exposes these controls mainly through the MULTISTAGE, TRAINING, TARGET, and DATASETS panels.


2. Where to run Wan 2.2 I2V training

You can run this Wan 2.2 I2V LoRA training workflow either on the cloud AI Toolkit on RunComfy or on a local AI Toolkit install. The UI and panels are the same; only the hardware changes.

2.1 RunComfy cloud AI Toolkit (recommended for first runs)

If you don’t want to manage CUDA, drivers, or large model downloads, use the cloud AI Toolkit on RunComfy:

👉 RunComfy AI Toolkit trainer

On that page you get the AI Toolkit UI pre‑installed in the browser. You can upload datasets, configure jobs exactly as in this guide, and run training on an H100 (80 GB) or H200 (141 GB) GPU. This is the easiest way to reproduce the tutorial reliably without touching local setup.


2.2 Local AI Toolkit

If you prefer to run locally: Install the AI Toolkit repository following the README (Python + PyTorch for training and Node for the UI), then run the UI (npm run build_and_start in ui/). Open http://localhost:8675 and you’ll see the same panels as in the screenshots and descriptions here.


3. Dataset design for Wan I2V LoRAs

Wan 2.2 I2V is trained on video‑clip + caption pairs. Each training sample is a sequence of frames plus text. In AI Toolkit you do not need to manually cut every clip to the same length. Instead you configure Num Frames in the DATASETS panel and the data loader will evenly sample that many frames from each video, automatically handling clips of different durations.

3.1 Decide what kind of LoRA you’re training

How you set hyper‑parameters depends heavily on your goal:

  • Motion / camera LoRA focuses on patterns like "orbit 360 around subject", "slow dolly zoom", "hand‑held jitter", or specific action beats.
  • Style LoRA makes videos look like a particular film stock, anime style, or painterly look, while still keeping Wan’s base motion and scene composition.
  • Character LoRA tries to preserve a specific character or face consistently across many scenes and motions.

Wan 2.2 I2V can do all three. Motion LoRAs lean more heavily on the high‑noise stage, while style and character LoRAs lean more on the low‑noise stage plus very consistent visuals.


3.2 Video clips and cropping

Use real video clips (.mp4, .mov, etc.), not GIFs. Clip length can vary (for example 5–30 seconds). AI Toolkit will evenly sample training frames along each clip according to your Num Frames setting.

The one thing you should always do by hand is to crop and trim each clip so that the motion you care about starts quickly and there is not a lot of "standing around" at the beginning or end. For motion LoRAs in particular, you want the motion to occupy almost the entire clip — for example, the full orbit, the full dolly move, or the full gesture.


3.3 How many clips do you need?

As a rough rule of thumb:

  • A simple motion LoRA that teaches a single type of camera move usually trains well on 10–30 short clips (~3–8s) where the target motion is very clear and occupies most of the frame.
  • A style LoRA typically needs 10–40 images or clips that cover different scenes, lighting, and subjects, but all share the same look and colour treatment.
  • A character LoRA on I2V behaves more like an image LoRA. As a minimum, aim for 10–30 short clips of the same character, with varied poses, scales, angles, and backgrounds; if you can comfortably reach 20–40 clips, likeness and robustness usually improve.

3.4 Captions for I2V clips

Each video file can optionally have a .txt caption with the same base name (for example castle_orbit.mp4 and castle_orbit.txt). AI Toolkit also supports a Default Caption that is used whenever a clip has no per‑file caption.

Good caption patterns:

  • For a motion LoRA, encode the motion explicitly in the text, for example:

    orbit 360 around the subject, orbit 180 around the subject, or slow dolly in toward the character.

  • For a style LoRA, describe the look, not the scene content, for example:

    grainy 16mm film look, high contrast, warm tint.

  • For a character LoRA, include a trigger word plus a class, for example:

    frung, young woman, casual clothing (where frung is your trigger token).

You can also combine a Trigger Word set in the JOB panel with captions that contain [trigger]. AI Toolkit will replace [trigger] with your chosen trigger string when loading the dataset so you don’t have to hard‑code the trigger name in every caption.


4. Wan 2.2 I2V specifics you need to understand

4.1 High‑noise vs low‑noise transformers

Wan I2V’s two transformers behave roughly like this:

The high‑noise transformer operates at timesteps near the start of the diffusion process (approximately 1000 down to ~875). It sets up global composition and coarse shapes and decides where objects go, how the camera moves, and what the motion trajectory will be. It is critical for motion and layout.

The low‑noise transformer runs at timesteps from about 875 down to 0. It refines details, textures, face likeness, and micro‑motions. It is critical for identity, texture, and sharpness.

In practice, training only the high‑noise stage can teach new kinds of movement and composition but tends to under‑train detail. Training only the low‑noise stage struggles to significantly change motion or layout at all. For most LoRAs you should train both stages and then steer emphasis using Timestep Bias in the TRAINING panel.


4.2 Frames, FPS and speed

Wan 2.2 I2V 14B can generate up to 81 frames at 16 FPS, which is 5 seconds. In practice, valid video frame counts follow the "4n+1" rule (for example 9, 13, 17, 21, 33, 41, 81…). You can think of video lengths in that family; 1 frame is also supported and effectively reduces I2V to a single‑frame image‑like mode for LoRA training.

In AI Toolkit there are two separate Num Frames knobs. Num Frames in the DATASETS panel controls how many frames per clip are sampled for training. Num Frames in the SAMPLE panel controls how long your preview videos are. They do not have to match exactly, but keeping them similar makes behaviour easier to reason about.

A good starting point for training is 41 frames (around 2.5 seconds). On 80–96 GB GPUs (H100‑class) you can go up to the full 81-frame configuration. Shorter lengths such as 21 or 33 frames can be used to reduce VRAM load and step time on small GPUs, at the cost of capturing less temporal context.


4.3 Resolution and pixel area

Wan’s official demos tend to keep the effective area around 480×832 ≈ 400k pixels, and the Hugging Face spaces snap dimensions to multiples of 16 or 32.

For LoRA training with AI Toolkit:

  • On a 24 GB GPU, use resolution buckets like 512 and 768. Avoid 1024×1024 unless you are very aggressively quantized and/or using layer offloading; video at 1024² plus 41–81 frames is heavy.
  • On 48 GB+ GPUs or H100/H200, you can safely add a 1024 bucket and even use cinematic widescreen resolutions centred around values like 1024×576, 1024×608, or 1024×640.

AI Toolkit will automatically bucket and downscale your videos into the selected resolutions when loading the dataset.


5. Step‑by‑step: configure a Wan 2.2 I2V 14B LoRA in AI Toolkit

We assume you have at least a 24 GB‑class GPU, so the settings below are a safe baseline. If you’re on a larger card or using the cloud AI Toolkit on RunComfy, some panels also include a short note on how to scale the settings up.


5.1 JOB panel

In the JOB panel you set basic metadata and, optionally, a trigger token.

  • Training Name

    Use any descriptive name; it becomes the folder name for checkpoints and samples. Examples: wan_i2v_orbit_v1, wan_i2v_style_neon, wan_i2v_char_frung_v1.

  • GPU ID

    On a local install this points to your physical GPU. On the RunComfy cloud AI Toolkit you can leave this as default; the actual machine type (H100/H200) is chosen later in the Training Queue.

  • Trigger Word (optional)

    Use a trigger for character or style LoRAs where you want a dedicated token such as frung or wan_cam_orbit. If your dataset captions contain [trigger], AI Toolkit will substitute your Trigger Word value into those captions automatically at load time.

    For pure motion LoRAs, you often do not need a trigger word because the behaviour is already encoded in phrases like "orbit 360 around the subject". For characters and styles, it is strongly recommended to use a trigger so you have a clean on/off switch for your LoRA later.


5.2 MODEL and QUANTIZATION panels

These panels control which Wan model checkpoint is used and how aggressively it is quantized.

MODEL panel

  • Model Architecture

    Select Wan 2.2 I2V (14B).

  • Name or Path

    lets you override the default Hugging Face / model hub path for ai-toolkit/Wan2.2-I2V-A14B-Diffusers-bf16. Leave it blank or at the default value and AI Toolkit will download the recommended base model from Hugging Face. Or point it to a local path if you want to use a custom Wan 2.2 checkpoint.

  • Low VRAM

    Turn Low VRAM ON for 24 GB consumer GPUs or any card that is also driving your display. On 48 GB+ cards (including H100/H200), you can often set it OFF for speed as long as you keep your training load reasonable (for example 512/768 buckets and ~41 frames). If you see intermittent OOMs (often caused by the largest resolution bucket) or you want to push 1024 buckets and/or 81 frames, turn Low VRAM ON for stability.

  • Layer Offloading

    This toggle streams parts of the model to CPU RAM instead of keeping all layers resident in VRAM. It is only necessary if you are trying to run Wan I2V on a very small GPU (around 10–12 GB VRAM) and have a lot of system RAM (64 GB+). It can roughly double step time but can bring peak VRAM below ~9 GB. For 24 GB GPUs, start with Layer Offloading OFF and only turn it on if you still hit out‑of‑memory errors.

On big GPUs / RunComfy:

On 48 GB+ or on H100/H200, start with Layer Offloading OFF. Keep Low VRAM OFF if you want maximum speed, but pair it with conservative buckets (512/768) and frames (≈41) first. If you push 1024/81 and hit OOM spikes, flip Low VRAM ON (or drop 1024) to stabilize the run.

QUANTIZATION panel

  • Transformer

    On 24–32 GB GPUs, set Transformer to 4bit with ARA. This uses a 4‑bit quantization together with an Accuracy Recovery Adapter so that VRAM usage is close to plain 4‑bit while quality stays much closer to bf16.

  • Text Encoder

    Set Text Encoder to float8 (or qfloat8)**. This reduces VRAM and compute for the text encoder with negligible impact on Wan 2.2 I2V LoRA quality.

This mirrors the official AI Toolkit example configs for Wan 2.2 video LoRAs and is the main reason training is practical on 24 GB cards. If you run into stability issues or severe slow‑downs with ARA on a particular setup, you can fall back to qfloat8 for the Transformer as well; it uses more VRAM but behaves very similarly in terms of quality.

On big GPUs / RunComfy:

On an H100/H200 or a 48–96 GB workstation card, you can either keep 4bit with ARA and spend the extra VRAM on higher resolution, more frames, or a higher LoRA rank, or switch the Transformer to a pure float8 / qfloat8 option for a simpler stack. Going all the way back to full bf16 everywhere is rarely necessary for LoRA training.


5.3 MULTISTAGE panel (high / low noise)

The MULTISTAGE panel lets you decide which Wan expert(s) to train and how often the trainer switches between them.

  • Stages to Train

    Keep both High Noise and Low Noise set to ON for most LoRAs. High noise controls composition and motion; low noise controls detail and identity.

  • Switch Every

    This value controls how many steps you run on one expert before swapping to the other. With High Noise = ON, Low Noise = ON, Switch Every = 10, and Steps = 3000, AI Toolkit trains:

    • Steps 1–10 on the high‑noise transformer,
    • Steps 11–20 on the low‑noise transformer,
    • and repeats this alternation until training is done.

On large GPUs you can use Switch Every = 1 (alternate every step) only if both experts stay resident in VRAM (no Low VRAM/offload/swap). If Low VRAM or any offloading/swapping is involved, each switch can trigger expensive unload/load, and Switch Every = 1 becomes extremely slow. In that case, prefer Switch Every = 10–50 to reduce swap overhead.

For a 24 GB GPU baseline, use:

  • High Noise = ON
  • Low Noise = ON
  • Switch Every = 10-50

On big GPUs / RunComfy:

If both experts stay resident (Low VRAM OFF, no offloading), you can set Switch Every = 1 for slightly smoother alternation. If you see slow step times or swapping, use 10–50 instead.


5.4 TARGET panel (LoRA network settings)

In the TARGET panel you configure what kind of adapter you are training and how "wide" it is.

  • Target Type

    Set Target Type to LoRA.

  • Linear Rank

    Linear Rank controls LoRA capacity per block. Higher rank increases capacity but also VRAM usage and the risk of overfitting. For Wan 2.2 I2V, practical defaults are:

    • Motion and camera LoRAs: Rank 16 is usually enough because they modify behaviour more than tiny visual details.
    • Style LoRAs: start with Rank 16; move to 32 only if the style is complex and you have VRAM headroom.
    • Character LoRAs: start with Rank 16 (even on large GPUs). Move to 32 only after you confirm your run is stable (no OOM spikes) and you specifically need more capacity for close‑up, high‑res faces.

On very large GPUs, Rank 32 can help for rich styles and demanding character work, but it is not required to get a good LoRA and it can make OOM spikes more likely when combined with large buckets and many frames.


5.5 SAVE panel

The SAVE panel controls how often checkpoints are written and in what precision.

  • Data Type

    Use BF16 or FP16. Both are fine for LoRAs. BF16 is slightly more numerically stable on modern GPUs.

  • Save Every

    Set Save Every to around 250. This gives you a checkpoint every 250 steps.

  • Max Step Saves to Keep

    Set Max Step Saves to Keep between 4 and 6. This keeps disk usage under control while still leaving you some earlier checkpoints to fall back to.

You do not have to use the last checkpoint. Very often the best‑looking samples come from somewhere around 2000–4000 steps. The SAMPLE panel configuration below explains how to judge this.

If you disable in‑training sampling (recommended below for current Wan I2V builds), keep a few checkpoints (for example every 250 steps) and evaluate them later using a separate inference workflow.


5.6 TRAINING panel

The TRAINING panel holds most of the important knobs: batch size, learning rate, timesteps, loss, and text encoder handling.

Core hyper‑parameters

Configure the core training settings like this for a 24 GB Wan I2V video LoRA:

  • Batch Size

    Start with 1. Video models are heavy, and 1 is realistic even on 24 GB cards. On H100/H200 you can later experiment with batch sizes of 2–4.

  • Gradient Accumulation

    Leave Gradient Accumulation at 1 initially. Effective batch size is batch size times gradient accumulation. You can raise it to 2 or 4 if VRAM is extremely tight and you want a slightly larger effective batch, but gains are modest for video.

  • Learning Rate

    Start with Learning Rate = 0.0001. This is the default in AI Toolkit examples and is stable for Wan LoRAs. If training looks noisy or the LoRA overshoots quickly, you can reduce to 0.00005 mid‑run and resume from the latest checkpoint.

  • Steps – typical ranges:
    • Small, focused motion LoRA with ~10–20 clips: 1500–2500 steps.
    • Character or style LoRA with 20–50 clips: 2000–3000 steps.
    • Very large datasets can go higher, but it is usually better to improve data quality (captions, diversity) than to push far beyond 3000–4000 steps.
    • 1000 steps: ~12–18 hours
    • 1500 steps: ~18–27 hours
    • 2000 steps: ~24–36 hours
    • 3000 steps: ~35–55 hours
  • Weight Decay

    Leave Weight Decay at 0.0001 unless you have a specific reason to change it; it provides mild regularization.

  • Loss Type

    Keep Loss Type as Mean Squared Error (MSE). Wan 2.2 uses a flow‑matching noise scheduler, and MSE is the standard loss for this setup.


Timesteps and scheduler

  • Timestep Type

    For Wan 2.2 I2V, Linear is the default Timestep Type and works well for most LoRA types. It spreads updates evenly along the flow‑matching schedule and plays nicely with the split between the high‑noise and low‑noise experts.

  • Timestep Bias

    Timestep Bias controls which part of the trajectory you emphasise:

    • Balanced – updates are spread across high‑ and low‑noise timesteps; this is the safe default for all LoRA types.
    • Favor High Noise – focuses more on early, noisy steps where Wan decides global layout, motion and colour.
    • Favor Low Noise – focuses more on late, clean steps where fine detail and identity live.
    • Motion / camera LoRAs – start with Timestep Type = Linear, Timestep Bias = Balanced. For very "pure" camera‑move LoRAs you can experiment with Favor High Noise to lean harder on the high‑noise expert.
    • Style LoRAs – use Timestep Type = Linear (or Shift) and Timestep Bias = Favor High Noise, so the LoRA rewrites global tone and colour while the base model still handles late‑stage details.
    • Character LoRAs – use Timestep Type = Sigmoid (or Linear) and Timestep Bias = Balanced. Identity and likeness depend more on low‑noise steps, but keeping the bias Balanced lets both experts contribute; only if you specifically want extra focus on micro‑detail should you try a slight low‑noise bias.

Under the hood, Wan 2.2 I2V uses a flow‑matching noise scheduler. AI Toolkit sets the scheduler and matching sampler automatically for the Wan 2.2 architecture, so you mainly steer behaviour via Timestep Type, Timestep Bias and the Multi‑stage settings above.


EMA (Exponential Moving Average)

  • Use EMA

    For LoRAs, EMA is optional and consumes extra VRAM and time. Most Wan LoRA users leave Use EMA OFF and it is rarely needed unless you are doing full‑model finetunes.


Text Encoder Optimizations

At the bottom of the TRAINING panel are the Text Encoder Optimizations settings. They control how aggressively the text encoder is offloaded or cached.

  • Unload TE

    This mode unloads the text encoder weights so they no longer consume VRAM between steps. For Wan 2.2 I2V LoRAs you almost always rely on rich per‑clip captions, so you should keep Unload TE OFF in normal caption‑based training. Only consider Unload TE if you are deliberately training a very narrow "trigger‑only / blank prompt" LoRA that does not use dataset captions at all.

  • Cache Text Embeddings

    This option pre‑computes caption embeddings once and reuses them, avoiding repeated text encoder passes. Turn Cache Text Embeddings ON only when your captions are static and you are not using features that modify or randomize the prompt each step, such as Differential Output Preservation, dynamic [trigger] rewriting in captions, or anything that heavily depends on caption dropout behaviour. In that case, AI Toolkit encodes all training captions once, caches the embeddings to disk, and can drop the text encoder out of VRAM.

If you plan to use DOP, Caption Dropout, or any other dynamic prompt tricks, keep Cache Text Embeddings OFF so the text encoder can re‑encode the real prompt every batch. The Differential Output Preservation and Datasets sections explain these interactions in more detail.


Regularization – Differential Output Preservation (DOP)

The Regularization section exposes Differential Output Preservation (DOP), which helps the LoRA behave like a residual edit instead of overwriting the base model.

DOP compares the base model’s output (without LoRA) to the LoRA‑enabled output and adds a penalty when the LoRA changes aspects unrelated to your target concept. It tries to teach "what changes when the trigger is present" rather than "re‑train the entire model".

For motion / camera LoRAs, you usually do not need DOP, because motion behaviour is already fairly localized. Enabling DOP roughly doubles compute by adding extra forward passes.

For style and character LoRAs, DOP is often very helpful for keeping Wan’s strong base realism intact. A good starting configuration is:

  • Differential Output Preservation: ON
  • DOP Loss Multiplier: 1
  • DOP Preservation Class: person for character LoRAs, or an appropriate class such as scene or landscape for style LoRAs if your build provides those options.

Important compatibility note: Differential Output Preservation rewrites or augments the prompt text each step (for example by swapping your trigger word for the preservation class word). Because of this, DOP is not compatible with Cache Text Embeddings. If you turn DOP ON, make sure Cache Text Embeddings is OFF so the text encoder sees the updated prompt every batch.


5.7 ADVANCED panel (Differential Guidance)

If your AI Toolkit build exposes the ADVANCED panel for this model, it may include Do Differential Guidance and Differential Guidance Scale.

Differential Guidance computes "with LoRA" vs "without LoRA" predictions and nudges training towards the difference between them, similar in spirit to DOP but implemented at the guidance level instead of as a separate loss term.

Practical recommendations:

  • Turn Do Differential Guidance ON with a Differential Guidance Scale around 3 for targeted edit‑style LoRAs (for example "make the camera orbit", "apply neon style") where you want the LoRA to behave like a cleaner modifier.
  • For very broad, heavy style LoRAs that rewrite the entire look, you can try lower scales (1–2) or leave it OFF if the LoRA feels too weak.

If you are tight on compute, you can safely leave Differential Guidance OFF for your first runs and experiment later.


5.8 DATASETS panel

Each dataset block in AI Toolkit maps to one entry in the datasets: list, but in the UI you simply configure one or more dataset cards.

A typical single Wan I2V dataset configuration looks like this:

  • Target Dataset

    Choose your uploaded Wan I2V video dataset folder, for example wan_orbit_clips.

  • Default Caption

    This caption is used when a clip has no .txt caption file. Examples:

    Motion LoRA: orbit 360 around the subject

    Style LoRA: cinematic neon cyberpunk style

    Character LoRA: frung, person, portrait (where frung is your trigger token).

  • Caption Dropout Rate

    This is the probability that the caption is dropped (replaced by an empty caption) for a training sample. For Wan I2V LoRAs, a small amount of dropout encourages the model to use both visual context and text. A typical starting range is 0.05–0.10 (5–10%) when the text encoder stays loaded. If you decide to enable Cache Text Embeddings in the TRAINING panel, it is often simpler to set Caption Dropout Rate = 0 so you avoid a subset of clips permanently having no caption.

  • LoRA Weight

    Usually set to 1. You only change this when mixing multiple datasets and you want one dataset to count more or less in training.

  • Settings → Cache Latents

    Keep this OFF for Wan I2V video datasets (Num Frames > 1). Many current AI Toolkit builds do not support caching latents for multi‑frame datasets and will fail during dataloader init with an error like:

    caching latents is not supported for multi-frame datasets

    If you intentionally set Num Frames = 1 (image‑like training), latent caching may work and can speed things up.

  • Settings → Is Regularization

    Leave Is Regularization OFF for your main dataset. If you add a separate regularization dataset later, you would set that dataset’s Is Regularization to ON.

  • Flipping

    Flip X and Flip Y mirror frames horizontally or vertically. For most video tasks you should keep both OFF, especially for motion LoRAs where flipping can invert left/right motion semantics or for characters with asymmetric features. For purely style‑only LoRAs you can experiment with Flip X to increase variation.

  • Resolutions

    Choose one or more resolution buckets. On a 24 GB GPU you typically enable 512 and leave 768 and 1024 disabled. On 48 GB+ or H100/H200, start with 512 and 768 for stability, then add 1024 only if you have clear VRAM headroom and your run is stable (bucketed training can spike VRAM when it hits the largest bucket). AI Toolkit will automatically assign clips to the nearest bucket and downscale as needed.

  • Num Frames

    Set Num Frames to the number of frames per clip you want to sample for training. A good starting point is 41. On very small GPUs (10–12 GB) with heavy quantization and offloading, you can reduce this to 21 or even 9 just to get training running, at the cost of shorter temporal context.

If you need multiple datasets (for example, a main motion dataset plus a small "style" dataset), you can add them all in the DATASETS panel and use LoRA Weight plus the Is Regularization flag to control their relative influence.


5.9 SAMPLE panel (training previews)

The SAMPLE panel does not influence training directly; it controls how AI Toolkit periodically generates preview videos so you can pick the best checkpoint.

Important (known issue): on some current AI Toolkit builds, Wan 2.2 I2V preview sampling can crash with a tensor channel mismatch during the scheduler step (often 36 vs 16). This most commonly happens on the very first preview at step=0, making it look like “training failed” even though the crash is in the preview pipeline.

Recommended (stable) settings: disable in‑training sampling

  • Disable Sampling

    Set Disable Sampling = ON (preferred). If your build doesn’t expose this toggle, set Sample Every = 0 (or a very large number) to effectively disable previews.

  • Skip First Sample

    Set Skip First Sample = ON if available, to avoid the common step=0 crash even if you later re‑enable sampling.

Then generate your validation videos after training (or in a separate inference job) using the checkpoints saved by the SAVE panel.

If sampling works in your environment

If your platform has patched the issue and sampling works, use these settings:

  • Sample Every

    Set Sample Every to 250. This matches the Save Every setting so each checkpoint has a corresponding set of preview videos.

  • Sampler

    Use a sampler compatible with Wan’s flow‑matching scheduler, typically shown as FlowMatch or similar in your build.

  • Width / Height

    On 24 GB GPUs, use something like 768 × 768 or a vertical format such as 704 × 1280 for samples. Avoid 1024×1024 preview videos unless you are comfortable with slower sampling; training itself does not require 1024² previews.

  • Guidance Scale

    Start with a Guidance Scale around 3.5–4, which matches many Wan 2.2 demo configs.

  • Sample Steps

    Set Sample Steps to 25. More steps rarely change motion quality dramatically and mostly increase time.

  • Seed / Walk Seed

    Set a fixed Seed like 42. Turn Walk Seed ON if you want each preview to get a different seed while still being clustered near the original.

  • Num Frames

    Set Num Frames in the SAMPLE panel equal or close to your training value. If you trained with 41 frames, sample with 41 as well. Once the LoRA looks good, you can test generalisation by generating longer clips at 81 frames; training at 41 often generalises surprisingly well to 81‑frame inference.

  • FPS

    Usually keep FPS = 16. Changing FPS only affects playback speed, not the learned motion itself.

For prompts, add 2–4 prompt rows that mirror your training distribution. For each row, attach a control image similar to what you’ll use at inference.


6. Wan 2.2 I2V 14B LoRA Training Settings for Motion, Style and Character

Here are quick recipes for common Wan 2.2 I2V LoRA types. Treat these as starting points and adjust based on checkpoint evaluation (in‑training previews may be disabled; see the SAMPLE panel).

6.1 Motion / camera LoRA

Goal: teach Wan a new motion like orbit 360, orbit 180, or a specific camera swing.

Use 10–30 short clips (~3–8s) where the target motion is very clear and occupies most of the clip. Captions should explicitly describe the motion, for example orbit 180 around the subject or orbit 360 around a futuristic city.

Panel guidelines:

  • MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10 (or 20–50 if Low VRAM/offloading causes slow swapping).
  • TARGET: Linear Rank = 16.
  • TRAINING: Learning Rate = 0.0001, Steps ≈ 1500–2500, Timestep Type = Linear, Timestep Bias = Balanced, DOP OFF.
  • DATASETS: Resolutions at 512/768, Num Frames = 33–41 (start at 41; 81 is possible on H100/H200 but expect ~2× time and higher VRAM), Caption Dropout Rate ≈ 0.05–0.1. Keep latent caching OFF for multi‑frame datasets.

Train with Save Every = 250. Sampling previews: due to a known Wan I2V sampling bug in many current builds (channel mismatch 36 vs 16), this guide recommends disabling in‑training sampling and evaluating checkpoints in a separate inference workflow. If sampling works in your environment, set Sample Every = 250.

When evaluating checkpoints, focus on whether the target motion is stable across different prompts and scenes; if it only works on near‑duplicates of your training clips, prefer improving data diversity or slightly increasing steps over pushing the bias away from Balanced.


6.2 Style LoRA (video look / grade)

Goal: change visual style while respecting Wan’s base motion and composition.

Use 10–40 images or clips that all share the same look but cover diverse scenes and subjects, for example grainy 16mm film look, high contrast, warm tint.

Panel guidelines:

  • MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10 (or 20–50 if Low VRAM/offloading causes slow swapping).
  • TARGET: Linear Rank = 16 for simple styles; 16–32 for complex or cinematic looks.
  • TRAINING: Learning Rate = 0.0001, Steps ≈ 1500–2500, Timestep Type = Linear (or Shift), Timestep Bias = Favor High Noise.
  • Regularization (DOP): Differential Output Preservation ON, DOP Loss Multiplier = 1, DOP Preservation Class matching your dominant subject (often person or scene), Cache Text Embeddings = OFF.
  • DATASETS: Resolutions 512/768 on 24 GB (and 512/768 on big GPUs, with optional 1024 only after stability), Num Frames = 33–41 on 24 GB (41–81 on H100/H200 if you can afford the time), Caption Dropout Rate around 0.05 if Cache Text Embeddings is OFF. Keep latent caching OFF for multi‑frame datasets.

Watch for whether the style applies consistently across scenes and lighting. If it starts to overpower content or make everything look the same, try lowering the learning rate mid‑run, stepping back to an earlier checkpoint, or reducing the LoRA rank.


6.3 Character LoRA (video likeness)

Character LoRAs on I2V are more challenging than on text‑to‑image models, but they are feasible.

Use 10–30 short clips of the same character in varied poses, scales, angles, and backgrounds; captions should always include your Trigger Word plus a class, for example frung, young woman, casual clothing. If you can gather 20–40 clips, identity robustness usually improves, but it is not strictly required to get usable results.

Panel guidelines:

  • MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10 (or 20–50 if Low VRAM/offloading causes slow swapping).
  • TARGET: Linear Rank = 16 on 24 GB; 16–32 on high‑VRAM GPUs (use 32 when you have headroom and care about close‑up, high‑res faces).
  • TRAINING: Learning Rate = 0.0001, Steps ≈ 2000–3000, Timestep Type = Sigmoid (or Linear), Timestep Bias = Balanced.
  • Regularization (DOP): Differential Output Preservation ON, DOP Loss Multiplier = 1, DOP Preservation Class = person.
  • DATASETS: Start with 512/768 (add 1024 only after stability), Num Frames = 33–41 on 24 GB, or 41–81 on H100/H200 (81 is significantly slower). Keep latent caching OFF for multi‑frame datasets.

Community experience suggests that identity and likeness lean more on the low‑noise expert, but keeping Timestep Bias = Balanced and using a shaped Timestep Type (Sigmoid) usually gives a better trade‑off between likeness and overall video stability than hard‑biasing toward low noise.


7. Troubleshooting common Wan I2V LoRA issues

Training crashes at step=0 during the first preview (tensor channel mismatch: 36 vs 16)

If your job “fails instantly” (or the run timeline is only a few seconds) and you see an error like:

RuntimeError: The size of tensor a (36) must match the size of tensor b (16)

(often inside diffusers/schedulers/scheduling_unipc_multistep.py), the crash is happening in the preview sampling pipeline, not in the actual training step.

Fix / workaround:

  • Disable in‑training sampling: set Disable Sampling = ON (or Sample Every = 0).
  • Enable Skip First Sample if your UI provides it.
  • Use the saved checkpoints from the SAVE panel and generate validation videos using a separate inference workflow after training.

Error: caching latents is not supported for multi-frame datasets

This happens when latent caching is enabled on a video dataset (Num Frames > 1).

Fix:

  • In the DATASETS panel, set Cache Latents / Cache Latents to Disk = OFF for Wan I2V video datasets.

Motion too fast compared to source

This usually happens if you trained with fewer frames per clip than your inference setting. For example, you might have trained at 21 or 41 frames but you’re sampling at 81 frames with FPS fixed at 16. The same motion gets "stretched" differently.

You can fix this by lowering FPS in the SAMPLE panel (for playback only), or by training and sampling at a consistent Num Frames such as 41 so temporal behaviour is more predictable.


Camera doesn’t move or composition barely changes

If the camera barely moves or composition looks like the base model:

Check that you are actually training the high‑noise stage and that Timestep Bias is not set too strongly toward low timesteps. Make sure High Noise is ON in the MULTISTAGE panel and Timestep Bias is Favor High for motion LoRAs. Also check that captions clearly describe the desired motion; Wan cannot learn motion that is neither visible nor named.


Details and faces look worse than base Wan

If your LoRA removes detail or worsens faces:

Try increasing Linear Rank slightly (for example from 16 to 32) and favouring low noise in the Timestep Bias so more training signal lands on late timesteps where identity and detail live. You can also lower the learning rate and resume from an earlier checkpoint.


LoRA overfits and only works on training‑like scenes

If the LoRA only looks correct on scenes very similar to the training data:

Reduce the total number of Steps (for example from 5000 down to 3000), increase dataset diversity, and consider enabling Differential Output Preservation if it is currently off. If DOP is already ON and the effect is still too narrow, slightly lower the LoRA rank and/or learning rate.


VRAM out‑of‑memory errors

If training frequently runs out of VRAM:

Reduce any combination of:

  • resolution buckets (drop 1024 and keep 512/768),
  • Num Frames (for example from 41 down to 21),
  • batch size (keep it at 1 if it isn’t already).

Turn Low VRAM ON, turn Layer Offloading ON if you only have 10–12 GB VRAM and plenty of system RAM, and make sure quantization is set to float8 for both the transformer and text encoder in the QUANTIZATION panel. If local VRAM is still not enough, consider running the same AI Toolkit job on RunComfy’s cloud with an H100 or H200 GPU, where you can keep settings much simpler.

If you are seeing OOM even on large GPUs (for example H100), it is usually a bucket spike problem:

  • Drop the 1024 bucket until the run is stable, then re‑add it later.
  • Reduce Num Frames (41 → 33 → 21).
  • Keep Layer Offloading OFF unless you truly need it (it can make runs slower and more swap-heavy).
  • If swapping is involved, increase MULTISTAGE Switch Every (10–50) to avoid per‑step unload/load overhead.
  • Prefer more aggressive quantization for memory: Transformer 4bit with ARA (or qfloat8 if ARA is unstable) and Text Encoder float8/qfloat8.

Training is much slower than expected (tens of seconds per step)

Wan 2.2 I2V LoRA training is slow by nature: each step processes many frames, and training both experts means you often need more total steps to give each stage enough updates.

If it feels unreasonably slow or keeps getting slower over time:

  • Reduce Num Frames (41 → 33 → 21).
  • Drop the 1024 bucket (stick to 512/768).
  • Avoid Layer Offloading unless you truly need it.
  • If Low VRAM/offload/swapping is enabled, don’t use Switch Every = 1; use 10–50.

👉 RunComfy AI Toolkit trainer


8. Export and use your Wan I2V LoRA

Once training is complete, you can use your Wan 2.2 I2V 14B LoRA in two simple ways:

  • Model playground – open the Wan 2.2 I2V 14B LoRA playground and paste the URL of your trained LoRA to quickly see how it behaves on top of the base model.
  • ComfyUI workflows – start a ComfyUI instance, build a workflow, plug in your LoRA, and fine‑tune its weight and other settings for more detailed control.

More AI Toolkit LoRA training guides

Ready to start training?