Wan 2.2 I2V 14B is the image‑to‑video model in the Wan 2.2 family, turning a single image into 5‑second clips with controllable motion, camera moves and temporal consistency. By the end of this guide, you’ll be able to:
- Design Wan I2V LoRA datasets for motion, style and character use cases (and know how many clips you actually need).
- Understand how Wan’s dual high-noise / low-noise experts, timestep settings, Num Frames and resolution interact during training.
- Configure the AI Toolkit panels (JOB, MODEL, QUANTIZATION, MULTISTAGE, TARGET, TRAINING, DATASETS, SAMPLE) for stable 24GB runs and for larger H100/H200 cloud setups.
- Export your Wan 2.2 I2V LoRA and plug it into ComfyUI or the Wan I2V LoRA playground for real-world projects.
This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this FLUX.2 [dev] guide.
Table of contents
- 1. What makes Wan 2.2 I2V 14B special?
- 2. Where to run Wan 2.2 I2V training
- 3. Dataset design for Wan I2V LoRAs
- 3.3 How many clips do you need?
- 4. Wan 2.2 I2V specifics you need to understand
- 5. Step‑by‑step: configure a Wan 2.2 I2V 14B LoRA in AI Toolkit
- 6. Wan 2.2 I2V 14B LoRA Training Settings for Motion, Style and Character
- 7. Troubleshooting common Wan I2V LoRA issues
- 8. Export and use your Wan I2V LoRA
1. What makes Wan 2.2 I2V 14B special?
Wan 2.2 I2V 14B ("A14B") is the image‑to‑video variant of Wan 2.2. Architecturally it is a dual‑stage Mixture‑of‑Experts (MoE) transformer. There are two separate 14B‑parameter transformers. The high‑noise transformer handles early, very noisy timesteps and is responsible for global composition, motion trajectory, and camera movement. The low‑noise transformer handles late, clean timesteps and is responsible for fine detail, identity, and texture.
At inference time, the pipeline splits timesteps around a boundary at roughly 875/1000 of the noise schedule and routes them to the high‑ or low‑noise transformer. In practice, each expert handles about half of the denoising process. Wan 2.2 I2V generates up to 81 frames at 16 FPS, which is about 5 seconds of video.
For LoRA training this has three key consequences. You can choose to train one or both stages. You can bias training toward composition and motion (high noise) or toward identity and detail (low noise). And because you process sequences of frames, frame count, resolution, VRAM and quantization/offloading settings matter much more than they do for an image‑only model.
AI Toolkit exposes these controls mainly through the MULTISTAGE, TRAINING, TARGET, and DATASETS panels.
2. Where to run Wan 2.2 I2V training
You can run this Wan 2.2 I2V LoRA training workflow either on the cloud AI Toolkit on RunComfy or on a local AI Toolkit install. The UI and panels are the same; only the hardware changes.
2.1 RunComfy cloud AI Toolkit (recommended for first runs)
If you don’t want to manage CUDA, drivers, or large model downloads, use the cloud AI Toolkit on RunComfy:
On that page you get the AI Toolkit UI pre‑installed in the browser. You can upload datasets, configure jobs exactly as in this guide, and run training on an H100 (80 GB) or H200 (141 GB) GPU. This is the easiest way to reproduce the tutorial reliably without touching local setup.
2.2 Local AI Toolkit
If you prefer to run locally: Install the AI Toolkit repository following the README (Python + PyTorch for training and Node for the UI), then run the UI (npm run build_and_start in ui/). Open http://localhost:8675 and you’ll see the same panels as in the screenshots and descriptions here.
3. Dataset design for Wan I2V LoRAs
Wan 2.2 I2V is trained on video‑clip + caption pairs. Each training sample is a sequence of frames plus text. In AI Toolkit you do not need to manually cut every clip to the same length. Instead you configure Num Frames in the DATASETS panel and the data loader will evenly sample that many frames from each video, automatically handling clips of different durations.
3.1 Decide what kind of LoRA you’re training
How you set hyper‑parameters depends heavily on your goal:
- Motion / camera LoRA focuses on patterns like "orbit 360 around subject", "slow dolly zoom", "hand‑held jitter", or specific action beats.
- Style LoRA makes videos look like a particular film stock, anime style, or painterly look, while still keeping Wan’s base motion and scene composition.
- Character LoRA tries to preserve a specific character or face consistently across many scenes and motions.
Wan 2.2 I2V can do all three. Motion LoRAs lean more heavily on the high‑noise stage, while style and character LoRAs lean more on the low‑noise stage plus very consistent visuals.
3.2 Video clips and cropping
Use real video clips (.mp4, .mov, etc.), not GIFs. Clip length can vary (for example 5–30 seconds). AI Toolkit will evenly sample training frames along each clip according to your Num Frames setting.
The one thing you should always do by hand is to crop and trim each clip so that the motion you care about starts quickly and there is not a lot of "standing around" at the beginning or end. For motion LoRAs in particular, you want the motion to occupy almost the entire clip — for example, the full orbit, the full dolly move, or the full gesture.
3.3 How many clips do you need?
As a rough rule of thumb:
- A simple motion LoRA that teaches a single type of camera move usually trains well on 10–30 short clips (~3–8s) where the target motion is very clear and occupies most of the frame.
- A style LoRA typically needs 10–40 images or clips that cover different scenes, lighting, and subjects, but all share the same look and colour treatment.
- A character LoRA on I2V behaves more like an image LoRA. As a minimum, aim for 10–30 short clips of the same character, with varied poses, scales, angles, and backgrounds; if you can comfortably reach 20–40 clips, likeness and robustness usually improve.
3.4 Captions for I2V clips
Each video file can optionally have a .txt caption with the same base name (for example castle_orbit.mp4 and castle_orbit.txt). AI Toolkit also supports a Default Caption that is used whenever a clip has no per‑file caption.
Good caption patterns:
- For a motion LoRA, encode the motion explicitly in the text, for example:
orbit 360 around the subject,orbit 180 around the subject, orslow dolly in toward the character. - For a style LoRA, describe the look, not the scene content, for example:
grainy 16mm film look, high contrast, warm tint. - For a character LoRA, include a trigger word plus a class, for example:
frung, young woman, casual clothing(wherefrungis your trigger token).
You can also combine a Trigger Word set in the JOB panel with captions that contain [trigger]. AI Toolkit will replace [trigger] with your chosen trigger string when loading the dataset so you don’t have to hard‑code the trigger name in every caption.
4. Wan 2.2 I2V specifics you need to understand
4.1 High‑noise vs low‑noise transformers
Wan I2V’s two transformers behave roughly like this:
The high‑noise transformer operates at timesteps near the start of the diffusion process (approximately 1000 down to ~875). It sets up global composition and coarse shapes and decides where objects go, how the camera moves, and what the motion trajectory will be. It is critical for motion and layout.
The low‑noise transformer runs at timesteps from about 875 down to 0. It refines details, textures, face likeness, and micro‑motions. It is critical for identity, texture, and sharpness.
In practice, training only the high‑noise stage can teach new kinds of movement and composition but tends to under‑train detail. Training only the low‑noise stage struggles to significantly change motion or layout at all. For most LoRAs you should train both stages and then steer emphasis using Timestep Bias in the TRAINING panel.
4.2 Frames, FPS and speed
Wan 2.2 I2V 14B can generate up to 81 frames at 16 FPS, which is 5 seconds. In practice, valid video frame counts follow the "4n+1" rule (for example 9, 13, 17, 21, 33, 41, 81…). You can think of video lengths in that family; 1 frame is also supported and effectively reduces I2V to a single‑frame image‑like mode for LoRA training.
In AI Toolkit there are two separate Num Frames knobs. Num Frames in the DATASETS panel controls how many frames per clip are sampled for training. Num Frames in the SAMPLE panel controls how long your preview videos are. They do not have to match exactly, but keeping them similar makes behaviour easier to reason about.
A good starting point for training is 41 frames (around 2.5 seconds). On 80–96 GB GPUs (H100‑class) you can go up to the full 81-frame configuration. Shorter lengths such as 21 or 33 frames can be used to reduce VRAM load and step time on small GPUs, at the cost of capturing less temporal context.
4.3 Resolution and pixel area
Wan’s official demos tend to keep the effective area around 480×832 ≈ 400k pixels, and the Hugging Face spaces snap dimensions to multiples of 16 or 32.
For LoRA training with AI Toolkit:
- On a 24 GB GPU, use resolution buckets like 512 and 768. Avoid 1024×1024 unless you are very aggressively quantized and/or using layer offloading; video at 1024² plus 41–81 frames is heavy.
- On 48 GB+ GPUs or H100/H200, you can safely add a 1024 bucket and even use cinematic widescreen resolutions centred around values like 1024×576, 1024×608, or 1024×640.
AI Toolkit will automatically bucket and downscale your videos into the selected resolutions when loading the dataset.
5. Step‑by‑step: configure a Wan 2.2 I2V 14B LoRA in AI Toolkit
We assume you have at least a 24 GB‑class GPU, so the settings below are a safe baseline. If you’re on a larger card or using the cloud AI Toolkit on RunComfy, some panels also include a short note on how to scale the settings up.
5.1 JOB panel
In the JOB panel you set basic metadata and, optionally, a trigger token.
- Training Name
Use any descriptive name; it becomes the folder name for checkpoints and samples. Examples:
wan_i2v_orbit_v1,wan_i2v_style_neon,wan_i2v_char_frung_v1. - GPU ID
On a local install this points to your physical GPU. On the RunComfy cloud AI Toolkit you can leave this as default; the actual machine type (H100/H200) is chosen later in the Training Queue.
- Trigger Word (optional)
Use a trigger for character or style LoRAs where you want a dedicated token such as
frungorwan_cam_orbit. If your dataset captions contain[trigger], AI Toolkit will substitute your Trigger Word value into those captions automatically at load time.For pure motion LoRAs, you often do not need a trigger word because the behaviour is already encoded in phrases like "orbit 360 around the subject". For characters and styles, it is strongly recommended to use a trigger so you have a clean on/off switch for your LoRA later.
5.2 MODEL and QUANTIZATION panels
These panels control which Wan model checkpoint is used and how aggressively it is quantized.
MODEL panel
- Model Architecture
Select
Wan 2.2 I2V (14B). - Name or Path
lets you override the default Hugging Face / model hub path for
ai-toolkit/Wan2.2-I2V-A14B-Diffusers-bf16. Leave it blank or at the default value and AI Toolkit will download the recommended base model from Hugging Face. Or point it to a local path if you want to use a custom Wan 2.2 checkpoint. - Low VRAM
Turn Low VRAM ON for 24 GB consumer GPUs or any card that is also driving your display. Turn it OFF on 48 GB+ cards or on H100/H200 in the cloud where you have more VRAM. When Low VRAM is on, AI Toolkit applies extra memory‑saving tricks (more quantization and checkpointing) to fit the model.
- Layer Offloading
This toggle streams parts of the model to CPU RAM instead of keeping all layers resident in VRAM. It is only necessary if you are trying to run Wan I2V on a very small GPU (around 10–12 GB VRAM) and have a lot of system RAM (64 GB+). It can roughly double step time but can bring peak VRAM below ~9 GB. For 24 GB GPUs, start with Layer Offloading OFF and only turn it on if you still hit out‑of‑memory errors.
On big GPUs / RunComfy:
On 48 GB+ or on H100/H200, set Low VRAM OFF and Layer Offloading OFF so all layers stay resident on the GPU and steps are as fast as possible.
QUANTIZATION panel
- Transformer
On 24–32 GB GPUs, set Transformer to
4bit with ARA. This uses a 4‑bit quantization together with an Accuracy Recovery Adapter so that VRAM usage is close to plain 4‑bit while quality stays much closer to bf16. - Text Encoder
Set Text Encoder to
float8(orqfloat8)**. This reduces VRAM and compute for the text encoder with negligible impact on Wan 2.2 I2V LoRA quality.
This mirrors the official AI Toolkit example configs for Wan 2.2 video LoRAs and is the main reason training is practical on 24 GB cards. If you run into stability issues or severe slow‑downs with ARA on a particular setup, you can fall back to qfloat8 for the Transformer as well; it uses more VRAM but behaves very similarly in terms of quality.
On big GPUs / RunComfy:
On an H100/H200 or a 48–96 GB workstation card, you can either keep 4bit with ARA and spend the extra VRAM on higher resolution, more frames, or a higher LoRA rank, or switch the Transformer to a pure float8 / qfloat8 option for a simpler stack. Going all the way back to full bf16 everywhere is rarely necessary for LoRA training.
5.3 MULTISTAGE panel (high / low noise)
The MULTISTAGE panel lets you decide which Wan expert(s) to train and how often the trainer switches between them.
- Stages to Train
Keep both High Noise and Low Noise set to ON for most LoRAs. High noise controls composition and motion; low noise controls detail and identity.
- Switch Every
This value controls how many steps you run on one expert before swapping to the other. With High Noise = ON, Low Noise = ON, Switch Every = 10, and Steps = 3000, AI Toolkit trains:
- Steps 1–10 on the high‑noise transformer,
- Steps 11–20 on the low‑noise transformer,
- and repeats this alternation until training is done.
On large GPUs without Low VRAM you can use Switch Every = 1, meaning you alternate every step. With Low VRAM ON, however, each switch forces AI Toolkit to unload one transformer and load the other; switching every step becomes very slow.
For a 24 GB GPU baseline, use:
- High Noise = ON
- Low Noise = ON
- Switch Every =
10
On big GPUs / RunComfy:
With Low VRAM disabled and enough VRAM for both experts, you can set Switch Every = 1 for slightly smoother alternation.
5.4 TARGET panel (LoRA network settings)
In the TARGET panel you configure what kind of adapter you are training and how "wide" it is.
- Target Type
Set Target Type to
LoRA. - Linear Rank
Linear Rank controls LoRA capacity per block. Higher rank increases capacity but also VRAM usage and the risk of overfitting. For Wan 2.2 I2V, practical defaults are:
- Motion and camera LoRAs: Rank 16 is usually enough because they modify behaviour more than tiny visual details.
- Style LoRAs: Rank 16 or 32 depending on how complex and varied the style is.
- Character LoRAs: Rank 16 on 24 GB cards, and 16–32 on high‑VRAM GPUs; use 32 when you have headroom and need extra capacity for close‑up, high‑res faces.
On very large GPUs, Rank 32 is a good default for rich styles and demanding character work, but it is not required just to get a LoRA running.
5.5 SAVE panel
The SAVE panel controls how often checkpoints are written and in what precision.
- Data Type
Use
BF16orFP16. Both are fine for LoRAs. BF16 is slightly more numerically stable on modern GPUs. - Save Every
Set Save Every to around
250. This gives you a checkpoint every 250 steps. - Max Step Saves to Keep
Set Max Step Saves to Keep between
4and6. This keeps disk usage under control while still leaving you some earlier checkpoints to fall back to.
You do not have to use the last checkpoint. Very often the best‑looking samples come from somewhere around 2000–4000 steps. The SAMPLE panel configuration below explains how to judge this.
5.6 TRAINING panel
The TRAINING panel holds most of the important knobs: batch size, learning rate, timesteps, loss, and text encoder handling.
Core hyper‑parameters
Configure the core training settings like this for a 24 GB Wan I2V video LoRA:
- Batch Size
Start with
1. Video models are heavy, and 1 is realistic even on 24 GB cards. On H100/H200 you can later experiment with batch sizes of2–4. - Gradient Accumulation
Leave Gradient Accumulation at
1initially. Effective batch size is batch size times gradient accumulation. You can raise it to 2 or 4 if VRAM is extremely tight and you want a slightly larger effective batch, but gains are modest for video. - Learning Rate
Start with Learning Rate =
0.0001. This is the default in AI Toolkit examples and is stable for Wan LoRAs. If training looks noisy or the LoRA overshoots quickly, you can reduce to 0.00005** mid‑run and resume from the latest checkpoint. - Steps – typical ranges:
- Small, focused motion LoRA with ~10–20 clips: 1500–2500 steps.
- Character or style LoRA with 20–50 clips: 2000–3000 steps.
- Very large datasets can go higher, but it is usually better to improve data quality (captions, diversity) than to push far beyond 3000–4000 steps.
- Weight Decay
Leave Weight Decay at
0.0001unless you have a specific reason to change it; it provides mild regularization. - Loss Type
Keep Loss Type as
Mean Squared Error(MSE)**. Wan 2.2 uses a flow‑matching noise scheduler, and MSE is the standard loss for this setup.
Timesteps and scheduler
Timesteps and scheduler
- Timestep Type
For Wan 2.2 I2V,
Linearis the default Timestep Type and works well for most LoRA types. It spreads updates evenly along the flow‑matching schedule and plays nicely with the split between the high‑noise and low‑noise experts. - Timestep Bias
Timestep Bias controls which part of the trajectory you emphasise:
- Balanced – updates are spread across high‑ and low‑noise timesteps; this is the safe default for all LoRA types.
- Favor High Noise – focuses more on early, noisy steps where Wan decides global layout, motion and colour.
- Favor Low Noise – focuses more on late, clean steps where fine detail and identity live.
- Motion / camera LoRAs – start with Timestep Type = Linear, Timestep Bias = Balanced. For very "pure" camera‑move LoRAs you can experiment with Favor High Noise to lean harder on the high‑noise expert.
- Style LoRAs – use Timestep Type = Linear (or Shift) and Timestep Bias = Favor High Noise, so the LoRA rewrites global tone and colour while the base model still handles late‑stage details.
- Character LoRAs – use Timestep Type = Sigmoid (or Linear) and Timestep Bias = Balanced. Identity and likeness depend more on low‑noise steps, but keeping the bias Balanced lets both experts contribute; only if you specifically want extra focus on micro‑detail should you try a slight low‑noise bias.
Under the hood, Wan 2.2 I2V uses a flow‑matching noise scheduler. AI Toolkit sets the scheduler and matching sampler automatically for the Wan 2.2 architecture, so you mainly steer behaviour via Timestep Type, Timestep Bias and the Multi‑stage settings above.
EMA (Exponential Moving Average)
- Use EMA
For LoRAs, EMA is optional and consumes extra VRAM and time. Most Wan LoRA users leave Use EMA OFF and it is rarely needed unless you are doing full‑model finetunes.
Text Encoder Optimizations
At the bottom of the TRAINING panel are the Text Encoder Optimizations settings. They control how aggressively the text encoder is offloaded or cached.
- Unload TE
This mode unloads the text encoder weights so they no longer consume VRAM between steps. For Wan 2.2 I2V LoRAs you almost always rely on rich per‑clip captions, so you should keep Unload TE OFF in normal caption‑based training. Only consider Unload TE if you are deliberately training a very narrow "trigger‑only / blank prompt" LoRA that does not use dataset captions at all.
- Cache Text Embeddings
This option pre‑computes caption embeddings once and reuses them, avoiding repeated text encoder passes. Turn Cache Text Embeddings ON only when your captions are static and you are not using features that modify or randomize the prompt each step, such as Differential Output Preservation, dynamic
[trigger]rewriting in captions, or anything that heavily depends on caption dropout behaviour. In that case, AI Toolkit encodes all training captions once, caches the embeddings to disk, and can drop the text encoder out of VRAM.
If you plan to use DOP, Caption Dropout, or any other dynamic prompt tricks, keep Cache Text Embeddings OFF so the text encoder can re‑encode the real prompt every batch. The Differential Output Preservation and Datasets sections explain these interactions in more detail.
Regularization – Differential Output Preservation (DOP)
The Regularization section exposes Differential Output Preservation (DOP), which helps the LoRA behave like a residual edit instead of overwriting the base model.
DOP compares the base model’s output (without LoRA) to the LoRA‑enabled output and adds a penalty when the LoRA changes aspects unrelated to your target concept. It tries to teach "what changes when the trigger is present" rather than "re‑train the entire model".
For motion / camera LoRAs, you usually do not need DOP, because motion behaviour is already fairly localized. Enabling DOP roughly doubles compute by adding extra forward passes.
For style and character LoRAs, DOP is often very helpful for keeping Wan’s strong base realism intact. A good starting configuration is:
- Differential Output Preservation: ON
- DOP Loss Multiplier:
1 - DOP Preservation Class:
personfor character LoRAs, or an appropriate class such assceneorlandscapefor style LoRAs if your build provides those options.
Important compatibility note: Differential Output Preservation rewrites or augments the prompt text each step (for example by swapping your trigger word for the preservation class word). Because of this, DOP is not compatible with Cache Text Embeddings. If you turn DOP ON, make sure Cache Text Embeddings is OFF so the text encoder sees the updated prompt every batch.
5.7 ADVANCED panel (Differential Guidance)
If your AI Toolkit build exposes the ADVANCED panel for this model, it may include Do Differential Guidance and Differential Guidance Scale.
Differential Guidance computes "with LoRA" vs "without LoRA" predictions and nudges training towards the difference between them, similar in spirit to DOP but implemented at the guidance level instead of as a separate loss term.
Practical recommendations:
- Turn Do Differential Guidance ON with a Differential Guidance Scale around
3for targeted edit‑style LoRAs (for example "make the camera orbit", "apply neon style") where you want the LoRA to behave like a cleaner modifier. - For very broad, heavy style LoRAs that rewrite the entire look, you can try lower scales (1–2) or leave it OFF if the LoRA feels too weak.
If you are tight on compute, you can safely leave Differential Guidance OFF for your first runs and experiment later.
5.8 DATASETS panel
Each dataset block in AI Toolkit maps to one entry in the datasets: list, but in the UI you simply configure one or more dataset cards.
A typical single Wan I2V dataset configuration looks like this:
- Target Dataset
Choose your uploaded Wan I2V video dataset folder, for example
wan_orbit_clips. - Default Caption
This caption is used when a clip has no
.txtcaption file. Examples:Motion LoRA:
orbit 360 around the subjectStyle LoRA:
cinematic neon cyberpunk styleCharacter LoRA:
frung, person, portrait(wherefrungis your trigger token). - Caption Dropout Rate
This is the probability that the caption is dropped (replaced by an empty caption) for a training sample. For Wan I2V LoRAs, a small amount of dropout encourages the model to use both visual context and text. A typical starting range is 0.05–0.10 (5–10%) when the text encoder stays loaded. If you decide to enable Cache Text Embeddings in the TRAINING panel, it is often simpler to set Caption Dropout Rate = 0 so you avoid a subset of clips permanently having no caption.
- LoRA Weight
Usually set to
1. You only change this when mixing multiple datasets and you want one dataset to count more or less in training. - Settings → Cache Latents
Turning Cache Latents ON encodes each clip’s frames into latents once and reuses them for training, which can significantly speed up training and remove the VAE from VRAM. For video this can consume disk space and I/O bandwidth, so use it when your storage and RAM budget allow.
- Settings → Is Regularization
Leave Is Regularization OFF for your main dataset. If you add a separate regularization dataset later, you would set that dataset’s Is Regularization to ON.
- Flipping
Flip X and Flip Y mirror frames horizontally or vertically. For most video tasks you should keep both OFF, especially for motion LoRAs where flipping can invert left/right motion semantics or for characters with asymmetric features. For purely style‑only LoRAs you can experiment with Flip X to increase variation.
- Resolutions
Choose one or more resolution buckets. On a 24 GB GPU you typically enable 512 and 768 and leave 1024 disabled. On 48 GB+ or H100/H200, you can enable 512, 768, and 1024. AI Toolkit will automatically assign clips to the nearest bucket and downscale as needed.
- Num Frames
Set Num Frames to the number of frames per clip you want to sample for training. A good starting point is
41. On very small GPUs (10–12 GB) with heavy quantization and offloading, you can reduce this to 21 or even 9 just to get training running, at the cost of shorter temporal context.
If you need multiple datasets (for example, a main motion dataset plus a small "style" dataset), you can add them all in the DATASETS panel and use LoRA Weight plus the Is Regularization flag to control their relative influence.
5.9 SAMPLE panel (training previews)
The SAMPLE panel does not influence training directly; it controls how AI Toolkit periodically generates preview videos so you can pick the best checkpoint.
Configure high‑level sampling settings like this:
- Sample Every
Set Sample Every to
250. This matches the Save Every setting so each checkpoint has a corresponding set of preview videos. - Sampler
Use a sampler compatible with Wan’s flow‑matching scheduler, typically shown as
FlowMatchor similar in your build. - Width / Height
On 24 GB GPUs, use something like
768 × 768or a vertical format such as704 × 1280for samples. Avoid 1024×1024 preview videos unless you are comfortable with slower sampling; training itself does not require 1024² previews. - Guidance Scale
Start with a Guidance Scale around
3.5–4, which matches many Wan 2.2 demo configs. - Sample Steps
Set Sample Steps to
25. More steps rarely change motion quality dramatically and mostly increase time. - Seed / Walk Seed
Set a fixed Seed like
42. Turn Walk Seed ON if you want each preview to get a different seed while still being clustered near the original. - Num Frames
Set Num Frames in the SAMPLE panel equal or close to your training value. If you trained with 41 frames, sample with 41 as well. Once the LoRA looks good, you can test generalisation by generating longer clips at 81 frames; training at 41 often generalises surprisingly well to 81‑frame inference.
- FPS
Usually keep FPS = 16. Changing FPS only affects playback speed, not the learned motion itself.
For prompts, add 2–4 prompt rows that mirror your training distribution. For each row, attach a control image similar to what you’ll use at inference.
6. Wan 2.2 I2V 14B LoRA Training Settings for Motion, Style and Character
Here are quick recipes for common Wan 2.2 I2V LoRA types. Treat these as starting points and adjust based on previews.
6.1 Motion / camera LoRA
Goal: teach Wan a new motion like orbit 360, orbit 180, or a specific camera swing.
Use 10–30 short clips (~3–8s) where the target motion is very clear and occupies most of the clip. Captions should explicitly describe the motion, for example orbit 180 around the subject or orbit 360 around a futuristic city.
Panel guidelines:
- MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10.
- TARGET: Linear Rank = 16.
- TRAINING: Learning Rate = 0.0001, Steps ≈ 1500–2500, Timestep Type = Linear, Timestep Bias = Balanced, DOP OFF.
- DATASETS: Resolutions at 512/768, Num Frames = 33–41 on 24 GB (up to 81 on H100/H200), Caption Dropout Rate ≈ 0.05–0.1.
Train with Save Every = 250 and Sample Every = 250. When inspecting samples, focus on whether the target motion is stable across different prompts and scenes; if it only works on near‑duplicates of your training clips, prefer improving data diversity or slightly increasing steps over pushing the bias away from Balanced.
6.2 Style LoRA (video look / grade)
Goal: change visual style while respecting Wan’s base motion and composition.
Use 10–40 images or clips that all share the same look but cover diverse scenes and subjects, for example grainy 16mm film look, high contrast, warm tint.
Panel guidelines:
- MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10.
- TARGET: Linear Rank = 16 for simple styles; 16–32 for complex or cinematic looks.
- TRAINING: Learning Rate = 0.0001, Steps ≈ 1500–2500, Timestep Type = Linear (or Shift), Timestep Bias = Favor High Noise.
- Regularization (DOP): Differential Output Preservation ON, DOP Loss Multiplier = 1, DOP Preservation Class matching your dominant subject (often
personorscene), Cache Text Embeddings = OFF. - DATASETS: Resolutions 512/768 on 24 GB (and 768–1024 on big GPUs), Num Frames = 33–41 on 24 GB, Caption Dropout Rate around 0.05 if Cache Text Embeddings is OFF.
Watch for whether the style applies consistently across scenes and lighting. If it starts to overpower content or make everything look the same, try lowering the learning rate mid‑run, stepping back to an earlier checkpoint, or reducing the LoRA rank.
6.3 Character LoRA (video likeness)
Character LoRAs on I2V are more challenging than on text‑to‑image models, but they are feasible.
Use 10–30 short clips of the same character in varied poses, scales, angles, and backgrounds; captions should always include your Trigger Word plus a class, for example frung, young woman, casual clothing. If you can gather 20–40 clips, identity robustness usually improves, but it is not strictly required to get usable results.
Panel guidelines:
- MULTISTAGE: High Noise = ON, Low Noise = ON, Switch Every = 10.
- TARGET: Linear Rank = 16 on 24 GB; 16–32 on high‑VRAM GPUs (use 32 when you have headroom and care about close‑up, high‑res faces).
- TRAINING: Learning Rate = 0.0001, Steps ≈ 2000–3000, Timestep Type = Sigmoid (or Linear), Timestep Bias = Balanced.
- Regularization (DOP): Differential Output Preservation ON, DOP Loss Multiplier = 1, DOP Preservation Class =
person. - DATASETS: Resolutions as high as your hardware allows (512/768 on 24 GB; add 1024 on big GPUs), Num Frames = 33–41 on 24 GB, or 41–81 on H100/H200.
Community experience suggests that identity and likeness lean more on the low‑noise expert, but keeping Timestep Bias = Balanced and using a shaped Timestep Type (Sigmoid) usually gives a better trade‑off between likeness and overall video stability than hard‑biasing toward low noise.
7. Troubleshooting common Wan I2V LoRA issues
Motion too fast compared to source
This usually happens if you trained with fewer frames per clip than your inference setting. For example, you might have trained at 21 or 41 frames but you’re sampling at 81 frames with FPS fixed at 16. The same motion gets "stretched" differently.
You can fix this by lowering FPS in the SAMPLE panel (for playback only), or by training and sampling at a consistent Num Frames such as 41 so temporal behaviour is more predictable.
Camera doesn’t move or composition barely changes
If the camera barely moves or composition looks like the base model:
Check that you are actually training the high‑noise stage and that Timestep Bias is not set too strongly toward low timesteps. Make sure High Noise is ON in the MULTISTAGE panel and Timestep Bias is Favor High for motion LoRAs. Also check that captions clearly describe the desired motion; Wan cannot learn motion that is neither visible nor named.
Details and faces look worse than base Wan
If your LoRA removes detail or worsens faces:
Try increasing Linear Rank slightly (for example from 16 to 32) and favouring low noise in the Timestep Bias so more training signal lands on late timesteps where identity and detail live. You can also lower the learning rate and resume from an earlier checkpoint.
LoRA overfits and only works on training‑like scenes
If the LoRA only looks correct on scenes very similar to the training data:
Reduce the total number of Steps (for example from 5000 down to 3000), increase dataset diversity, and consider enabling Differential Output Preservation if it is currently off. If DOP is already ON and the effect is still too narrow, slightly lower the LoRA rank and/or learning rate.
VRAM out‑of‑memory errors
If training frequently runs out of VRAM:
Reduce any combination of:
- resolution buckets (drop 1024 and keep 512/768),
- Num Frames (for example from 41 down to 21),
- batch size (keep it at 1 if it isn’t already).
Turn Low VRAM ON, turn Layer Offloading ON if you only have 10–12 GB VRAM and plenty of system RAM, and make sure quantization is set to float8 for both the transformer and text encoder in the QUANTIZATION panel. If local VRAM is still not enough, consider running the same AI Toolkit job on RunComfy’s cloud with an H100 or H200 GPU, where you can keep settings much simpler.
8. Export and use your Wan I2V LoRA
Once training is complete, you can use your Wan 2.2 I2V 14B LoRA in two simple ways:
- Model playground – open the Wan 2.2 I2V 14B LoRA playground and paste the URL of your trained LoRA to quickly see how it behaves on top of the base model.
- ComfyUI workflows – start a ComfyUI instance, build a workflow, plug in your LoRA, and fine‑tune its weight and other settings for more detailed control.
More AI Toolkit LoRA training guides
Ready to start training?

