Wan 2.2 T2V 14B LoRA Training with AI Toolkit

Wan 2.2 T2V 14B is the text‑to‑video model in the Wan 2.2 series, generating rich 5‑second clips with strong motion, detail and camera control from plain prompts. By the end of this guide, you’ll be able to:

Train Wan 2.2 T2V 14B LoRAs with AI Toolkit for consistent characters, strong styles, and motion/camera behaviours.
Choose between local training on a 24GB+ NVIDIA GPU (with 4‑bit ARA quantization) and cloud training on H100/H200 GPUs, and understand what each tier can realistically handle.
Understand how Wan’s high‑noise and low‑noise experts interact with Multi‑stage, Timestep Type/Bias, Num Frames, and resolution, so you can control where the LoRA injects changes.
Configure AI Toolkit panel‑by‑panel (JOB, MODEL, QUANTIZATION, MULTISTAGE, TARGET, SAVE, TRAINING, DATASETS, SAMPLE) so you can adapt the same recipe to different LoRA goals and hardware.

This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this FLUX.2 [dev] guide.

1. Wan 2.2 T2V 14B overview for LoRA training
2. Where to train Wan 2.2 T2V LoRAs (local vs cloud)
3. Hardware & VRAM expectations for Wan 2.2 T2V LoRAs
4. Building a Wan 2.2 T2V LoRA dataset
5. Step‑by‑step: train a Wan 2.2 T2V 14B LoRA in AI Toolkit
6. Wan 2.2 T2V 14B LoRA Training Settings
7. Export and use your Wan T2V LoRA

1. Wan 2.2 T2V 14B overview for LoRA training

Wan 2.2 is a family of open text/video models with three main variants: a 5B text/image‑to‑video model, and two 14B models (T2V and I2V). (Wan 2.2 GitHub). This guide targets the 14B text‑to‑video model Wan2.2‑T2V‑A14B.

Dual‑transformer "high noise / low noise" design

Under the hood, Wan 2.2 14B uses a Mixture‑of‑Experts text‑to‑video backbone:

High‑noise: ~14B‑parameter transformer that handles the very noisy early part of denoising (rough composition, global motion, camera).
Low‑noise: ~14B‑parameter transformer that refines relatively clean frames near the end (details, texture, identity).

Together the model has about 27B parameters, but at each diffusion step only one expert (≈14B parameters) is active. Timesteps are split around t ≈ 875 of 1000 in the noise schedule: roughly 1000→875 go to the high‑noise expert and 875→0 go to the low‑noise expert, with internal shifting to keep coverage roughly balanced over the trajectory.

For LoRA training this means:

You generally want to train both experts so your LoRA works across the whole denoising chain – both composition/motion and details/identity.
On smaller GPUs it’s expensive to keep both transformers in VRAM and swap them every step, which is why AI Toolkit exposes a Multi‑stage panel and Low VRAM + ARA quantization + "Switch Every N steps" options to trade speed vs VRAM.

2. Where to train Wan 2.2 T2V LoRAs (local vs cloud)

You can follow this tutorial in two environments; the AI Toolkit UI is the same.

Option A – Local AI Toolkit (your own GPU)

Install AI Toolkit from GitHub AI Toolkit repository and run the web UI. This is best if you are comfortable with CUDA/drivers and already have a 24GB+ NVIDIA GPU (RTX 4090 / 5090 / A6000, etc.).
Because Wan 2.2 14B is heavy, Macs and GPUs under 24GB are generally only suitable for tiny image‑only LoRAs at 512 resolution (Num Frames = 1). For serious video LoRAs you really want 24GB+ and aggressive quantization.

Option B – Cloud AI Toolkit on RunComfy (H100 / H200)

Open the cloud AI Toolkit on RunComfy and log in. You get dropped straight into the AI Toolkit interface with all dependencies pre‑installed.
For Wan 2.2 T2V LoRA training, choose an H100 (80GB) or H200 (141GB) machine when you start the job so you can train long videos at higher resolution.

Benefits of using the cloud:

Zero setup – CUDA, drivers, and model weights are already configured.
Huge VRAM – you can run 33–81 frame LoRAs at 768–1024 resolution with sane batch sizes and without fighting OOM errors.
Persistent workspace – your datasets, jobs, and LoRA checkpoints live in your RunComfy account, so you can resume or iterate later.

3. Hardware & VRAM expectations for Wan 2.2 T2V LoRAs

Wan 2.2 14B is much heavier than image models or Wan 2.1:

Official T2V workflows at 1024×1024 and 81 frames can OOM even on high‑end consumer GPUs if you don’t quantize.
Long‑sequence LoRA training at 1024² / 81 frames can take many hours even on 48–96GB server cards, especially at 2–4k steps.
The official AI Toolkit example configuration for this model (train_lora_wan22_14b_24gb.yaml) is tuned for 24GB GPUs and uses 4‑bit ARA quantization with Num Frames = 1 (image‑only) as the safe default.

A reasonable mental model by VRAM tier:

Tier	Example GPUs	What’s comfortable
24GB "consumer"	4090 / 5090 / A6000	Image‑only LoRAs (Num Frames = 1) at 512–768 px, using 4‑bit ARA and Low VRAM = ON. Short video LoRAs (33–41 frames @ 512) are possible but slow and VRAM‑tight.
48–64GB "prosumer"	dual 4090, some server GPUs	33–41 frame video LoRAs at 768–1024 px with 4‑bit ARA and minimal offloading. Good balance of speed, capacity, and quality.
80–141GB "cloud"	H100 / H200 on RunComfy	81‑frame training at 1024², Batch Size 1–2, little or no offloading, using float8 or 4‑bit ARA. Ideal for serious, long‑sequence video LoRAs.

4. Building a Wan 2.2 T2V LoRA dataset

Wan T2V LoRAs can be trained on:

Images – treated as 1‑frame "videos" (Num Frames = 1).
Video clips – the real strength of the T2V model; you’ll usually work with short 3–8s clips.

4.1 Decide what kind of LoRA you are training

Think in terms of three broad families and design your dataset accordingly:

Character LoRA (face / body / outfit)
Goal: keep Wan’s general abilities but inject a new person, avatar, or outfit that you can address via a trigger. Use 10–30 high‑quality images or short clips of the same person, with varied poses, backgrounds, and lighting. Avoid heavy filters or stylization that fight the base model. Include a unique trigger token in captions (e.g. "zxq-person"), plus a rich description of clothing, lighting, and framing so the LoRA learns the concept cleanly.
Style LoRA (look & feel)
Goal: keep content flexible but impose a visual style (film stock, anime look, painterly, etc.). Use 10–40 images or clips that share the same look – consistent colours, contrast, camera feeling – but with diverse subjects and scenes. Captions should lean hard into style words, e.g. "oil painting, thick impasto, warm orange lighting, high contrast" rather than enumerating exact objects.
Motion / camera LoRA
Goal: teach Wan temporal behaviours (orbits, pans, dollies, sprite‑like loops, etc.). Use 10–30 short clips (~5s) that show the target motion, ideally the same kind of motion across different subjects and environments. Captions must explicitly mention the motion keyword, such as "orbit 180 around the subject", "side‑scrolling attack animation", or "slow dolly zoom in on the character" so the model knows what behaviour you care about.

4.2 Resolution and aspect ratio

Wan 2.2 14B T2V is built for square‑ish 1024×1024‑class frames. Official examples use 1024² or close variants, with internal bucketing for lower resolutions.

For training:

On 24GB GPUs, prefer 512 or 768 resolution buckets and uncheck 1024 in the DATASETS panel to save VRAM.
On 48GB+ GPUs or H100/H200, you can include both 768 and 1024 buckets to get sharper results, especially for character and style LoRAs.

AI Toolkit will downscale and bucket your videos into the selected resolutions; you mainly need to ensure your source clips are high quality and not letterboxed with huge black bars.

4.3 Video clip length and Num Frames

Wan 2.2 was pre‑trained on roughly 5‑second clips at 16 FPS, giving around 81 frames per training sequence (following a 4k+1 pattern).

AI Toolkit’s Num Frames field in the DATASETS panel controls how many frames are sampled from each video:

For images, set Num Frames = 1 – each image is treated as a 1‑frame video.
For videos, good choices are:

81 – "full fidelity"; matches pre‑training but is very VRAM‑hungry.
41 – around half the frames and roughly half the VRAM/time; a strong mid‑ground for bigger GPUs.
33 – an aggressive, VRAM‑friendly option for 24GB local training when combined with 512 px resolution.

Frames are sampled evenly across each clip, so you do not need every video to be exactly 5 seconds long. What matters is that the useful motion occupies the clip: trim away long static intros/outros so almost every sampled frame contains meaningful motion or identity signal.

Frame counts are typically chosen to follow the Wan‑specific "4n+1" pattern (e.g. 9, 13, 17, 21, 33, 41, 81). Sticking to these values tends to produce more stable temporal behaviour because it matches the model’s internal windowing.

4.4 Captioning strategy

Per‑clip captions matter more for video LoRAs than for simple image LoRAs, especially for motion and style.

For image / character LoRAs, aim for 10–30 images or short clips, each with a caption that includes your trigger plus a description, for example:
"portrait of [trigger], medium shot, studio lighting, wearing a leather jacket, 35mm lens".

At training time AI Toolkit will replace [trigger] with the actual Trigger Word from the JOB panel if you use that pattern.
For motion LoRAs, make sure the motion word appears and is consistent across clips, e.g.:
"orbit 180 around a medieval castle",

"side‑scrolling attack animation of a teddy bear swinging a sword".

For now, simply ensure that every image or clip either has a good per‑file .txt caption or that you will set a useful Default Caption in the DATASETS panel. In the TRAINING section we’ll decide whether to run in caption‑based mode (using these captions directly) or Trigger Word‑only mode on high‑VRAM setups.

5. Step‑by‑step: train a Wan 2.2 T2V 14B LoRA in AI Toolkit

In this section we walk panel‑by‑panel through the AI Toolkit UI for a video LoRA on Wan 2.2 T2V 14B.

Baseline assumptions for this walkthrough:

You are training a video LoRA (Num Frames = 33) at 512 or 768 resolution.
You are on a 24–32GB GPU or running an equivalent setup on RunComfy with Low VRAM tricks.
Your dataset is one Wan T2V dataset folder with videos + captions.

Later we’ll add notes for H100/H200 and heavier VRAM tiers.

5.1 JOB panel – basic job metadata

Set the high‑level metadata so you can find your job later:

Job Name – a concise name such as wan22_t2v_char_zxq_v1 or wan22_t2v_style_neon_v1. Include model, task, and a short identifier.
Output Directory – where AI Toolkit will write checkpoints and logs, e.g. ./output/wan22_t2v_char_zxq_v1.
GPU ID – on a local install this points to your physical GPU. On the RunComfy cloud AI Toolkit you can leave this as default; the actual machine type (H100/H200) is chosen later in the Training Queue.
Trigger Word (optional) – if you plan to use a trigger‑word workflow, set this to your token (for example zxqperson). In captions you can write [trigger] and AI Toolkit will replace it with your Trigger Word at load time. Keep it short and unique so it doesn’t collide with existing tokens.

5.2 MODEL panel – Wan 2.2 T2V base model

Configure the base model and VRAM‑related options:

Model Architecture – choose Wan 2.2 T2V 14B (or equivalent label in your build).

Name or Path lets you override the default Hugging Face / model hub path for Wan 2.2 T2V 14B. Leave it blank or at the default and AI Toolkit will download the recommended base model from Hugging Face. Or point it to a local path if you want to use a custom Wan 2.2 checkpoint.

Low VRAM – on 24–32GB GPUs, set Low VRAM = ON so AI Toolkit can use extra checkpointing/offload strategies that make training possible. On H100/H200 or 48GB+ you can set Low VRAM = OFF for maximum speed.
Layer Offloading – if your build exposes this, you can leave it OFF on 24GB+ unless you are still hitting OOM. On extremely tight setups it can stream some layers to CPU RAM, at the cost of noticeably slower steps.

5.3 QUANTIZATION panel – 4‑bit ARA + float8 text encoder

Quantization is what makes Wan 2.2 T2V LoRA training practical on consumer hardware.

Transformer – set to 4bit with ARA. This is a 4‑bit quantization with an Accuracy Recovery Adapter; VRAM usage is close to plain 4‑bit, but quality is much closer to bf16.
Text Encoder – set to float8 (or qfloat8)**. This reduces VRAM and compute for the text encoder with negligible impact on LoRA training quality.

On 24–32GB GPUs, this combination is the main reason video LoRA training is feasible at all.

On H100/H200 / 48GB+ GPUs:

You can still keep 4bit with ARA and spend extra VRAM on higher resolution, more frames, or higher LoRA rank, which often gives a better return.
If you prefer a simpler stack, you can switch the Transformer to a pure float8 option while leaving the Text Encoder at float8. Going all the way back to full bf16 everywhere is usually not necessary.

5.4 MULTISTAGE panel – train high‑ and low‑noise experts

This panel exposes the dual‑expert architecture (high‑noise vs low‑noise transformer) and how training steps are split between them.

Stages to Train – for most LoRAs, set High Noise = ON and Low Noise = ON. This means both experts are updated during training so the LoRA affects both early composition/motion and late details/identity.
Switch Every – on 24–32GB GPUs with Low VRAM = ON, set Switch Every = 10. This tells AI Toolkit how many steps to spend on one expert before swapping to the other. For example, with Steps = 3000:

Steps 1–10 → High‑noise expert
Steps 11–20 → Low‑noise expert
…repeat until the end of training.

Why this matters:

With Low VRAM = ON, AI Toolkit typically keeps only one expert in GPU memory at a time. When it switches, it unloads one ~14B‑parameter transformer and loads the other.
If you set Switch Every = 1, you force a load/unload of huge weights every step, which is extremely slow.
With Switch Every = 10, you still get roughly 50/50 coverage of high/low noise, but only swap every 10 steps instead of every step, which is far more efficient.

LoRA‑type hints:

For character or style video LoRAs, keep both High Noise and Low Noise ON; both composition and details matter.
For motion / camera LoRAs, high noise is crucial for global motion. Start with both stages ON and then experiment later with high‑noise‑only training if you want very targeted behaviour.

On H100/H200:

You can set Switch Every = 1, since both experts can stay resident in VRAM and the overhead of swapping is negligible.

5.5 TARGET panel – LoRA rank and capacity

This panel controls what kind of adapter you train and how much capacity it has.

Target Type – set to LoRA.
Linear Rank – a good default is 16 for Wan 2.2 T2V:

Rank 16 keeps the LoRA small and fast to train.
It is usually enough for character, style, and motion LoRAs at 512–768 resolution.

If you have a very diverse dataset (many subjects, styles, or motions) and enough VRAM:

You can increase Linear Rank to 32 to give the LoRA more expressive power.
Avoid going beyond 64 unless you know you need that much capacity; very high ranks can overfit and make the LoRA harder to control.

On H100/H200, starting at Rank 16 and going up to 32 for complex all‑in‑one LoRAs is a reasonable range.

5.6 SAVE panel – checkpoint schedule

Configure how often to save LoRA checkpoints during training:

Data Type – set to BF16. This matches how Wan 2.2 is usually run and is stable for LoRA weights.
Save Every – set to 250 steps. For a 3000‑step run this yields 12 checkpoints spread across training.
Max Step Saves to Keep – set to 4 or 6 so you don’t lose early checkpoints that might actually look better than the final one.

In practice you rarely end up using the very last checkpoint; many users prefer something in the 2000–3000 step range after comparing samples.

On H100/H200:

If you run very long (e.g. 5000–6000 steps for a large dataset), either keep Save Every = 250 and increase Max Step Saves to Keep, or set Save Every = 500 to limit the number of checkpoints.

5.7 TRAINING panel – core hyperparameters and text encoder mode

We now set the core training hyperparameters, then choose how to handle the text encoder and optional regularization.

5.7.1 Core training settings

For a general‑purpose video LoRA on Wan 2.2 T2V:

Batch Size – on 24–32GB, set Batch Size = 1. For T2V this already consumes a lot of VRAM. On H100/H200 you may push to 2 if you have enough headroom.
Gradient Accumulation – start with 1. If VRAM is tight but you want a larger effective batch, you can set it to 2–4; effective batch size is Batch Size × Gradient Accumulation.
Steps – typical ranges:

Small, focused motion LoRA with ~10–20 clips: 1500–2500 steps.
Character or style LoRA with 20–50 clips: 2000–3000 steps.
Very large datasets may go higher, but it is often better to improve data quality than to just add more steps.

Optimizer – set Optimizer = AdamW8Bit. 8‑bit Adam reduces VRAM significantly while behaving similarly to standard AdamW.
Learning Rate – set Learning Rate = 0.0001 as a strong default. If training looks unstable or samples oscillate wildly between steps, lower it to 0.00005. If training seems to plateau early, consider increasing steps rather than pushing Learning Rate higher.
Loss Type – keep Mean Squared Error (MSE). This matches Wan’s original training loss and is the standard choice.

Wan 2.2 uses a flow‑matching noise scheduler, which AI Toolkit handles internally. In the SAMPLE panel you should also use a FlowMatch‑compatible sampler so previews match the training setup.

5.7.2 Timestep Type and Timestep Bias – where the LoRA focuses

These two fields control which timesteps are emphasized during training and how updates are distributed across the diffusion chain.

Timestep Type – controls the distribution of timesteps:

Linear – samples timesteps uniformly across the schedule; a neutral, safe default.
Sigmoid / other shaped patterns – bias training toward mid/low noise; sometimes helpful for characters and detailed styles.
Shift / Weighted – further emphasize specific regions of the noise schedule, often combined with Timestep Bias.

Timestep Bias – tells AI Toolkit which part of the trajectory to emphasize:

Balanced – updates spread roughly evenly across high and low noise.
Favor High Noise – biases toward early, noisy steps, emphasising composition, layout, and global motion.
Favor Low Noise – biases toward later, clean steps, emphasising identity, texture, and micro‑details.

Recommended combinations for Wan 2.2 T2V:

Motion / camera LoRA – set Timestep Type = Linear and Timestep Bias = Balanced as a safe default.
If you want a pure motion LoRA that really locks in camera paths, you can push this further to Timestep Bias = Favor High Noise, since the high‑noise expert is where Wan 2.2 decides layout and motion.
Style LoRA – set Timestep Type = Linear or Shift and Timestep Bias = Favor High Noise.
Style, colour grading and "film stock" live mostly in the high‑noise / early part of the trajectory, so favouring high noise lets the LoRA rewrite global tone while leaving late‑stage details mostly to the base model.
Character LoRA – set Timestep Type = Sigmoid (or Linear) and Timestep Bias = Balanced.
Identity and likeness lean more on the low‑noise expert, but you still want some influence on composition and lighting. For very identity‑focused LoRAs you can experiment with slightly favouring low‑noise steps, but Balanced is the safest default.

5.7.3 EMA (Exponential Moving Average)

Use EMA – for LoRAs, EMA is optional and adds extra overhead. Most users leave this OFF for Wan 2.2 LoRAs and reserve EMA for full‑model training. It is safe to ignore EMA unless you know you want to ensemble smoother weights.

5.7.4 Text Encoder Optimizations – caption vs trigger‑word mode

These toggles control whether the text encoder stays loaded and whether embeddings are cached.

Unload TE – if set ON, AI Toolkit will remove the text encoder from VRAM between steps and rely on static embeddings (e.g. a Trigger Word), effectively turning off dynamic captioning during training. This saves VRAM but means captions will not be re‑encoded each step.
Cache Text Embeddings – when set ON, AI Toolkit runs the text encoder once per caption, caches the embeddings, and then safely frees the text encoder from VRAM. This is highly recommended for caption‑based training on constrained VRAM, as it avoids re‑encoding every step but still uses your per‑clip captions.

Typical patterns:

For 24–32GB caption‑based training, set Cache Text Embeddings = ON and leave Unload TE = OFF. This gives you efficient training with full caption information.
For Trigger Word‑only training on very high VRAM (H100/H200), you can set Unload TE = ON and rely on a single trigger token instead of full captions.

5.7.5 Differential Output Preservation (DOP)

Differential Output Preservation is an optional regularization that encourages the LoRA to behave like a pure residual edit of the base model:

AI Toolkit renders two predictions:

one with the base model (no LoRA), and
one with the LoRA enabled.

It penalizes differences between these outputs except where you explicitly want change (via your Trigger Word and captions).

Key fields:

Differential Output Preservation – main toggle.
DOP Loss Multiplier – strength of the regularization loss.
DOP Preservation Class – a class token like person, scene, or landscape that describes what should be preserved.

Usage:

For style and character LoRAs, DOP can help keep Wan’s excellent base realism intact while the LoRA adds a controlled modification. A simple recipe:

Differential Output Preservation = ON
DOP Loss Multiplier = 1
DOP Preservation Class = person for character LoRAs, or scene / landscape for broad style LoRAs if available.

For motion / camera LoRAs, you usually do not need DOP; the behaviour change is already localized, and DOP roughly doubles compute.

Important compatibility note:

DOP works by rewriting prompts each step (swapping your Trigger Word with the Preservation Class in one of the branches). Because of this, DOP requires the text encoder to re‑encode prompts every step, and it is not compatible with Cache Text Embeddings.
If you turn DOP ON:

you must set a Trigger Word in the JOB panel,
and you must keep Cache Text Embeddings = OFF so the text encoder stays active and can re‑encode the modified prompts every step.

On H100/H200, the extra compute cost of DOP is usually acceptable for high‑quality character and style LoRAs.

5.8 ADVANCED panel – Differential Guidance (optional)

If your build exposes an ADVANCED panel with:

Do Differential Guidance
Differential Guidance Scale

you can treat it as an additional, AI‑Toolkit‑specific trick:

Turning Do Differential Guidance = ON with Scale = 3 tells the model to focus more on the difference between base and LoRA‑modified predictions, similar in spirit to DOP but implemented as a guidance term.
This can make targeted edits (e.g. "neon outline style" or "orbit camera behaviour") converge faster without raising Learning Rate.
If samples look unstable or too sharp early in training, you can lower the scale to 2. If learning feels very slow, you can experiment with 4.

Most users can safely leave this OFF for their first Wan 2.2 LoRAs and experiment once they are comfortable.

5.9 DATASETS panel – wiring your Wan T2V dataset

Each Dataset block corresponds to one entry in the internal datasets: list.

For a single Wan T2V dataset:

Target Dataset – select your Wan T2V dataset folder (e.g. wan_orbit_clips or wan_char_zxq_clips) containing your videos and captions.
LoRA Weight – set to 1 unless you mix multiple datasets and want to rebalance them.
Default Caption – used only when individual clips have no .txt caption. For example:

Character/style: "portrait of zxqperson, zxqstyle, cinematic lighting".
Motion: "orbit 360 around the subject, zxq_orbit".

Caption Dropout Rate – a value like 0.05 drops captions for 5% of samples so the model also pays attention to visuals instead of overfitting phrasing.
If you rely heavily on Cache Text Embeddings, be conservative here; caption dropout is most effective when the text encoder is active and captions can vary.
Settings → Cache Latents – for video LoRAs this is usually OFF because caching VAE latents for many frames is heavy on disk and RAM. Keep your source videos high quality instead.
Settings → Is Regularization – leave OFF unless you have a dedicated regularization dataset.
Flipping (Flip X / Flip Y) – for most video LoRAs keep both OFF:

horizontal flips can break left/right motion semantics and character asymmetry,
vertical flips are rarely appropriate for real‑world footage.

Resolutions – enable the resolutions you want AI Toolkit to bucket into:

On 24–32GB, enable 512, optionally 768 if VRAM allows, and disable 1024+.
On H100/H200, you can enable 768 and 1024 to match the model’s preferred operating point.

Num Frames – set Num Frames = 33 for the baseline 24–32GB video LoRA recipe.
33 follows the 4n+1 rule (4·8+1), roughly halves cost vs full 81‑frame training while still giving a clear temporal pattern.

AI Toolkit will sample 33 frames evenly along each clip’s duration; you just need to trim clips so the motion you care about spans most of the clip.

On H100/H200, you can push Num Frames to 41 or 81, and combine that with 768–1024 px buckets and Rank 16–32 for very strong, long‑sequence LoRAs.

5.10 SAMPLE panel – previewing your LoRA

The SAMPLE panel is for generating preview videos during or after training.

Useful settings:

Num Frames – match this roughly to the training value (e.g. 33 or 41) so behaviour is predictable.
Sampler / Scheduler – use a FlowMatch‑compatible sampler that aligns with the model’s noise schedule.
Prompt / Negative Prompt – use the same Trigger Word and concepts you trained on so you can quickly judge whether the LoRA is doing the right thing.
Guidance Scale – during training previews, moderate values (e.g. 2–4) are fine; remember that you might use different values in your normal inference workflows later.

Generate samples at multiple checkpoints (e.g. every 250–500 steps) and keep the ones that visually balance strength and stability.

6. Wan 2.2 T2V 14B LoRA Training Settings

This section summarizes practical recipes for the three main LoRA types.

6.1 Character video LoRA (identity / avatar)

Goal: preserve a character’s face, body, and general identity across many prompts and scenes.

Dataset:

10–30 short clips or images of the character, with varied poses, backgrounds, and lighting.
Captions include a Trigger Word and class, for example:
"portrait of [trigger], young woman, casual clothing, studio lighting".

Key settings:

Num Frames – 33 on 24GB; 41 or 81 on H100/H200.
Resolutions – 512 or 768; add 1024 on high VRAM.
Multi‑stage – High Noise = ON, Low Noise = ON, Switch Every = 10 (local) or 1 (cloud).
Timestep Type / Bias – Linear (or Sigmoid) with Balanced bias, so you capture both composition and low‑noise identity detail.
Linear Rank – 16 (24GB) or 16–32 (H100/H200) for more nuanced identity.
DOP – optionally enable for character LoRAs when you want to preserve base realism:

Differential Output Preservation = ON
DOP Loss Multiplier = 1
DOP Preservation Class = person
Cache Text Embeddings = OFF (required for DOP to work)

Steps – 2000–3000, checking samples every 250–500 steps.

6.2 Style video LoRA (film look / anime / colour grade)

Goal: impose a strong visual style while keeping content flexible.

Dataset:

10–40 images or clips that share the same style across different subjects and scenes.
Captions describe the look (e.g. film stock, brushwork, palette) rather than the exact objects.

Key settings:

Num Frames – 33–41 for most use cases; 81 on big GPUs for 5s clips.
Resolutions – 512–768 on 24GB; 768–1024 on high VRAM.
Multi‑stage – High Noise = ON, Low Noise = ON, Switch Every = 10 (local) or 1 (cloud).
Timestep Type / Bias – Linear or Shift with Timestep Bias = Favor High Noise, so the LoRA can rewrite global colour and contrast where composition is still fluid.
Linear Rank – 16 for simple styles; 16–32 for complex, cinematic looks.
DOP – recommended for style LoRAs when you want to preserve base realism:

Differential Output Preservation = ON
DOP Loss Multiplier = 1
DOP Preservation Class = scene / landscape or similar
Cache Text Embeddings = OFF

Steps – 1500–2500, stopping when style looks strong but not overbaked.

6.3 Motion / camera LoRA (orbits, pans, dolly moves)

Goal: learn new camera moves or motion patterns that you can apply to many subjects.

Dataset:

10–30 3–8s clips, each showing the target motion.
Keep motion consistent (e.g. all are orbit 180 or all are side‑scrolling), but vary subjects and scenes.
Captions explicitly state the motion keyword ("orbit 180 around the subject", "side‑scrolling attack animation").

Key settings:

Num Frames – 33 on 24GB, 41–81 on bigger GPUs.
Resolutions – 512 (and 768 if VRAM allows).
Multi‑stage – High Noise = ON, Low Noise = ON, Switch Every = 10 (local) or 1 (cloud).
Timestep Type / Bias – Linear with Timestep Bias = Balanced, so both early composition and later refinement see updates; motion inherently leans on high noise.
Linear Rank – Rank 16 is usually enough; motion is more about behaviour than tiny details.
DOP – usually keep OFF; motion is already localized and DOP doubles forward passes.
Steps – 1500–2500; watch previews to ensure motion generalizes beyond your training clips.

7. Export and use your Wan T2V LoRA

Once training is complete, you can use your Wan 2.2 T2V 14B LoRA in two simple ways:

Model playground – open the Wan 2.2 T2V 14B LoRA playground and paste the URL of your trained LoRA to quickly see how it behaves on top of the base model.
ComfyUI workflows – start a ComfyUI instance and build your own workflow, add your LoRA and fine‑tune the LoRA weight and other settings for more detailed control.

OstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Advanced

Datasets

Dataset 1

Sample

Table of contents

1. Wan 2.2 T2V 14B overview for LoRA training

Dual‑transformer "high noise / low noise" design

2. Where to train Wan 2.2 T2V LoRAs (local vs cloud)

Option A – Local AI Toolkit (your own GPU)

Option B – Cloud AI Toolkit on RunComfy (H100 / H200)

3. Hardware & VRAM expectations for Wan 2.2 T2V LoRAs

4. Building a Wan 2.2 T2V LoRA dataset

4.1 Decide what kind of LoRA you are training

4.2 Resolution and aspect ratio

4.3 Video clip length and Num Frames

4.4 Captioning strategy

5. Step‑by‑step: train a Wan 2.2 T2V 14B LoRA in AI Toolkit

5.1 JOB panel – basic job metadata

5.2 MODEL panel – Wan 2.2 T2V base model

5.3 QUANTIZATION panel – 4‑bit ARA + float8 text encoder

5.4 MULTISTAGE panel – train high‑ and low‑noise experts

5.5 TARGET panel – LoRA rank and capacity

5.6 SAVE panel – checkpoint schedule

5.7 TRAINING panel – core hyperparameters and text encoder mode

5.7.1 Core training settings

5.7.2 Timestep Type and Timestep Bias – where the LoRA focuses

5.7.3 EMA (Exponential Moving Average)

5.7.4 Text Encoder Optimizations – caption vs trigger‑word mode

5.7.5 Differential Output Preservation (DOP)

5.8 ADVANCED panel – Differential Guidance (optional)

5.9 DATASETS panel – wiring your Wan T2V dataset

5.10 SAMPLE panel – previewing your LoRA

6. Wan 2.2 T2V 14B LoRA Training Settings

6.1 Character video LoRA (identity / avatar)

6.2 Style video LoRA (film look / anime / colour grade)

6.3 Motion / camera LoRA (orbits, pans, dolly moves)

7. Export and use your Wan T2V LoRA

More AI Toolkit LoRA training guides