This page is the overview of LoRA fine-tuning with the Ostris AI Toolkit. For a model-specific recipe, jump to one of these guides:
- FLUX.2 Dev LoRA training with AI Toolkit
- Z‑Image Turbo LoRA training with AI Toolkit
- Qwen‑Image‑Edit‑2509 LoRA training with AI Toolkit
- Wan 2.2 I2V 14B LoRA training with AI Toolkit
- Wan 2.2 T2V 14B LoRA training with AI Toolkit
By the end of this guide, you should:
- Understand the core ideas behind LoRA training (what’s really happening when you fine‑tune a model).
- Know how the AI Toolkit is organized and what each panel controls.
- Understand what key parameters (learning rate, rank, steps, noise schedule, DOP, etc.) do so you can tune them deliberately.
- Be able to train LoRAs either on your own machine or with the RunComfy Cloud AI Toolkit, and then reuse those LoRAs in your normal generation workflows.
Table of contents
- 1. What is Ostris AI Toolkit? (LoRA trainer for diffusion models)
- 2. Supported models in Ostris AI Toolkit (Flux, Wan, Z‑Image, Qwen‑Image, SDXL)
- 3. Installing Ostris AI Toolkit locally and on RunComfy Cloud AI Toolkit
- 4. Ostris AI Toolkit Web UI overview (Dashboard, Datasets, New LoRA Job)
- 5. LoRA training basics and core hyperparameters for AI Toolkit
- 6. Mapping LoRA concepts to AI Toolkit parameters
- 7. Step‑by‑step example: training a LoRA with Ostris AI Toolkit
- 8. Troubleshooting AI Toolkit LoRA training: common errors and fixes
1. What is Ostris AI Toolkit? (LoRA trainer for diffusion models)
Ostris AI Toolkit is a training suite focused on diffusion models for images and video. It does not handle language or audio models; everything it supports is either a classic DDPM‑style diffusion model (such as SD 1.5 or SDXL) or a modern diffusion‑transformer model such as Flux, Wan, Qwen‑Image, Z‑Image or OmniGen2. It is built around LoRA‑style adapters: in practice, when you fine‑tune a model with AI Toolkit you are not retraining the entire network, you are training small LoRA (or similar lightweight adapters) on top of a frozen base model.
Key features of Ostris AI Toolkit for LoRA training
AI Toolkit provides a common training engine and configuration system for all supported model families. Each model (Flux, Z‑Image Turbo, Wan 2.2, Qwen‑Image, SDXL, etc.) has its own preset, but they all plug into the same structure: model loading, quantization settings, LoRA/LoKr adapter definition, training hyper‑parameters, dataset handling and sampling rules. That’s why the Web UI looks familiar whether you are training an AI Toolkit Flux LoRA, a Z‑Image Turbo LoRA or a Wan 2.2 video LoRA.
On top of this engine, AI Toolkit ships with both a CLI and a full Web UI. The CLI runs jobs directly from YAML configs; the Web UI is a graphical layer over those configs. In the UI, "AI Toolkit" usually means the New Job screen where you pick a model family, choose a LoRA type and rank, set learning rate and steps, attach one or more datasets and define how often to generate sample images or videos. You get dedicated panels for Job, Model, Quantization, Target, Training, Regularization, Datasets and Sample, so you rarely need to touch raw YAML unless you want to. Whether you run it locally or via a cloud setup such as the RunComfy Cloud AI Toolkit, this workflow is the same.
Built‑in LoRA training tools in Ostris AI Toolkit
AI Toolkit bakes in a number of "batteries‑included" features that you would otherwise need to script or glue together by hand:
- Quantization and low‑VRAM modes – configurable 8‑bit / 6‑bit / 4‑bit (and 3‑bit with recovery adapters) transformer quantization plus layer offloading, so large models like Flux or Wan can be trained on 24–48 GB GPUs with controllable quality/speed trade‑offs.
- LoRA / LoKr adapters – support for standard LoRA as well as LoKr (a more compact but less universally supported variant), selectable via
Target Typeso you can choose between maximum compatibility and smaller, higher‑capacity adapters. - Differential Output Preservation (DOP) – a regularization loss that compares base‑model vs LoRA outputs on "regularization" images and penalizes unwanted changes, helping to reduce LoRA "bleeding" where every output starts to look like your subject.
- Differential Guidance for turbo‑style models – an optional training‑time guidance term (used heavily for Z‑Image Turbo) that focuses the update on "what should change" relative to the base model, improving adaptation on few‑step / turbo models without destroying their speed benefits.
- Multi‑stage noise training – separate high‑noise and low‑noise training stages so you can balance coarse structure learning (composition, pose) with fine detail sharpening (textures, edges).
- Latent and text‑embedding caching –
Cache LatentsandCache Text Embeddingstrade disk space for speed and lower VRAM, which is particularly helpful on smaller GPUs or in cloud sessions where you want to iterate quickly. - EMA (Exponential Moving Average) – an optional smoothed copy of the LoRA weights that can make convergence more stable, especially on small datasets.
The Web UI exposes all of these features through clear controls, and because the layout is consistent across models, once you understand how AI Toolkit trains a LoRA for one base (for example, Flux), it is straightforward to apply the same reasoning to Z‑Image Turbo, Wan, Qwen‑Image and other supported diffusion models.
2. Supported models in Ostris AI Toolkit (Flux, Wan, Z‑Image, Qwen‑Image, SDXL)
The AI Toolkit currently supports the following model families:
- IMAGE models – single images (Flux, Z‑Image Turbo, Qwen‑Image, SD, etc.).
- INSTRUCTION / EDIT models – image editing / instruction following models (Qwen‑Image‑Edit, Flux Kontext, HiDream E1).
- VIDEO models – text‑to‑video and image‑to‑video (Wan 2.x series).
| Category | Model family in AI Toolkit UI | Typical purpose |
|---|---|---|
| IMAGE | FLUX.1 / FLUX.2 | Flagship FlowMatch image models; high‑quality style/character LoRAs at 1024+ resolution. |
| INSTRUCTION | FLUX.1‑Kontext‑dev | Paired/conditional image training (before/after, 360°, multi‑view, turnarounds). |
| IMAGE | Qwen‑Image | Strong bilingual text‑to‑image model; LoRAs for style/character control |
| INSTRUCTION | Qwen‑Image‑Edit, Qwen‑Image‑Edit‑2509 | Image editing / instruction‑following models; LoRAs for specific edit styles or effects |
| IMAGE | Z‑Image Turbo (w/ Training Adapter) | Distilled image model with a dedicated training adapter for LoRA fine‑tuning. |
| VIDEO | Wan 2.2 (14B) | Newer Wan video base; high‑quality text‑to‑video / image‑to‑video generation. |
| VIDEO | Wan 2.2 T2V (14B) | Wan 2.2 text‑to‑video base for cinematic, prompt‑driven video LoRAs. |
| VIDEO | Wan 2.2 I2V (14B) | Wan 2.2 image‑to‑video model for animating stills into motion. |
| VIDEO | Wan 2.2 T12V (5B) | Efficient Wan 2.2 hybrid model; lighter 5B version for text‑ and image‑to‑video. |
| VIDEO | Wan 2.1 (1.3B / 14B) | Earlier Wan video models; smaller and larger variants for T2V. |
| VIDEO | Wan 2.1 I2V (14B‑480P / 14B‑720P) | Wan 2.1 image‑to‑video at different base resolutions. |
| IMAGE | SD 1.5, SDXL | "Classic" Stable Diffusion models; backward‑compatible LoRAs and legacy pipelines. |
| IMAGE | OmniGen2 | All‑round modern image base; general‑purpose LoRAs. |
| IMAGE | Chroma | High‑quality image model for cinematic / photoreal styles. |
| IMAGE | Lumina2 | Modern image model; good for general LoRA training. |
| IMAGE | HiDream | Image generation model related to HiDream video; style and character LoRAs. |
| INSTRUCTION | HiDream E1 | Instruction‑style / frame‑conditioned image or video training. |
| IMAGE | Flex.1 / Flex.2 | Lightweight general‑purpose image models. |
More models are sometimes added or revised, and the same Web UI structure applies across them.
3. Installing Ostris AI Toolkit locally and on RunComfy Cloud AI Toolkit
3.1 Install Ostris AI Toolkit locally on Linux and Windows
The official README on GitHub gives straightforward installation instructions for Linux and Windows.
On Linux:
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python3 -m venv venv
source venv/bin/activate
# install PyTorch with CUDA (adjust version if needed)
pip3 install --no-cache-dir torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
--index-url https://download.pytorch.org/whl/cu126
pip3 install -r requirements.txt
On Windows, you can either follow the same pattern with python -m venv venv and .\venv\Scripts\activate, or use the community AI‑Toolkit Easy Install batch script, which wraps the whole process into a single click and automatically opens the UI in your browser for the latest version).
To start the Web UI once dependencies are installed:
cd ui
npm run build_and_start
The interface will be available at http://localhost:8675. If you run it on a remote machine, set AI_TOOLKIT_AUTH to a password first so only you can access the UI (see the AI Toolkit GitHub repository for security notes).
3.2 Use RunComfy Cloud AI Toolkit for LoRA training (no local setup)
If you don’t want to deal with GPU drivers, CUDA, or local installs at all, you can use the RunComfy Cloud AI Toolkit. In this mode:
- AI Toolkit runs entirely in the cloud – you just open a browser and you’re in the UI.
- You have access to powerful GPUs (80 GB and 141 GB VRAM), ideal for heavy FLUX, Qwen‑Image, Z‑Image Turbo, or Wan LoRA training.
- Your datasets, configs, checkpoints, and past jobs live in a persistent workspace tied to your RunComfy account.
- Training, playground for model testing, and ComfyUI workflows all live in one place.
Open it directly here: Cloud AI Toolkit on RunComfy
4. Ostris AI Toolkit Web UI overview (Dashboard, Datasets, New LoRA Job)
When you open the Web UI (local or on RunComfy), the left sidebar has a small but important set of pages:
4.1 Dashboard and Training Queue
The Dashboard shows active and recent jobs at a glance. It’s mainly a quick status page.
The Training Queue page is where you:
- see each job’s state (queued, running, finished, failed),
- open logs to debug issues,
- stop or delete jobs,
- download output checkpoints and sample images.
Think of it as the "job control center". Every LoRA you train will show up here.
4.2 Dataset manager
The Datasets page lets you define named datasets that you can attach to jobs:
- You select or upload image folders or video clips.
- The UI scans them and shows resolutions, counts, and how many captions / metadata entries exist.
- Each dataset gets an internal name that later appears in the job’s
Target Datasetdropdown.
This is where you create:
- main training datasets (your character, style, product shots),
- optional regularization datasets (other people, other trucks, generic backgrounds, etc.) for DOP or classic regularization.
4.3 New Job: the core LoRA configuration screen
The New Job page is the heart of AI Toolkit. A job is essentially:
Train a LoRA of type X on model Y, using dataset Z, with these hyperparameters.
The screen is divided into panels:
- JOB – naming and GPU selection.
- MODEL – which base model to fine‑tune.
- QUANTIZATION – how aggressively the base model is compressed.
- TARGET – LoRA vs LoKr and rank.
- SAVE – checkpoint precision and frequency.
- TRAINING – learning rate, steps, optimizer, timestep schedule.
- ADVANCED / Regularization – EMA, Differential Guidance, DOP.
- DATASETS – which dataset(s) to train on, and how.
- SAMPLE – how often to generate reference images or videos while training.
The rest of this guide is mostly about helping you understand how these panels relate back to the core LoRA concepts.
5. LoRA training basics and core hyperparameters for AI Toolkit
Before touching any AI Toolkit controls, it helps to have a mental model of what LoRA training is doing behind the scenes.
5.1 How LoRA works inside diffusion models
A modern diffusion model is mostly a stack of transformer blocks with large weight matrices. In vanilla fine‑tuning, you would update all these weights directly, which is expensive and easy to overfit.
In all supported models (such as Flux, Z‑Image Turbo, Wan, Qwen‑Image), the backbone is a large diffusion transformer. LoRA does not replace the original weight matrix W; instead, it adds a small low‑rank update built from two learned matrices A and B. You can think of it as: W_new = W + alpha A B, where W is the frozen original weight matrix, A and B are small trainable matrices, and alpha is a scaling factor that controls how strong the LoRA update is at inference time.
The rank determines the width of matrices A and B, and therefore how complex the LoRA update can be. A higher rank makes the LoRA more expressive but also heavier in terms of parameters and compute. A lower rank gives you a smaller, more focused adapter that is lighter and generally harder to overfit.
5.2 Key LoRA hyperparameters explained
These names appear in every trainer; AI Toolkit just exposes them clearly.
Learning Rate (Learning Rate)
- Controls how large a step we take in parameter space each time the optimizer updates the LoRA.
- Too low: training is slow and might not fit your dataset well.
- Too high: the loss bounces or explodes, and the LoRA becomes noisy, unstable, or wildly overfitted.
For diffusion LoRAs, 0.0001 is a very sensible default. Many published Wan and Flux configs fall in the 0.0001 – 0.0002 range.
Batch Size and Gradient Accumulation
Batch Sizeis how many images/clips the model sees in parallel for each gradient computation.Gradient Accumulationmeans "keep accumulating gradients for N batches before actually applying an update", which simulates a larger batch without needing more VRAM.
Effective batch size is: Batch Size × Gradient Accumulation
Higher effective batch gives smoother gradients and better generalization, but costs more compute. Many people run with Batch Size = 1 and Gradient Accumulation = 2–4 on 24 GB GPUs.
Steps (Steps)
This is how many optimizer updates you will do. It’s the main knob for "how long do we train".
- Too few steps → underfitting: the LoRA barely changes the base model.
- Too many steps → overfitting: the LoRA memorizes training images and bleeds into everything.
The right number depends on: dataset size, image/video variety, rank, learning rate.
For typical 20–50 image character LoRAs on modern models, 2 000–3 000 steps is a good starting range.
Rank (Linear Rank)
- Rank determines how many degrees of freedom your LoRA has.
- Doubling the rank roughly doubles the LoRA’s capacity and parameter count.
Practical intuition:
- Rank 16–32 is enough for most characters and styles on large models like Flux or Wan.
- Higher ranks make it easier to overfit small datasets; lower ranks force the LoRA to generalize.
Weight Decay (Weight Decay)
Weight decay is a standard regularization trick: it gently pulls weights toward zero at each step.
- It reduces the chance that the LoRA will "snap" to extreme values that perfectly recreate training images but don’t generalize.
- Values like
0.0001are common and usually safe. You rarely need to touch it until you see obvious overfitting.
Timestep schedule
Diffusion models learn to denoise across a range of noise levels. You choose which timesteps to sample more often:
- High noise: model learns coarse structure, composition, big shapes.
- Low noise: model learns fine textures and details.
- Mid noise: where structure and detail meet; great for faces and characters.
The Timestep Type and Timestep Bias parameters in AI Toolkit are just UI handles for this scheduling, which we’ll unpack in the parameter section.
Dataset composition and captions
Even with perfect hyperparameters, bad data gives a bad LoRA:
- Use clean, varied images that all match the concept (same person, same brand, same style) but with different poses, lighting, and backgrounds.
- Captions should clearly tie a unique trigger word to the concept so you can activate the LoRA later without breaking the base model’s vocabulary.
On video LoRAs (Wan, HiDream E1), you have the same logic but with short clips instead of individual images, and frame sampling becomes part of the dataset design.
6. Mapping LoRA concepts to AI Toolkit parameters
Now we’ll walk through the New Job screen panel by panel and connect each parameter to the concepts above.
6.1 JOB panel: project, GPU, and trigger word
The JOB panel is simple but important:
Training Name - This is just the job’s label and becomes part of the output folder and file names. Many people include both the model and trigger word, e.g. flux_dev_skschar_v1.
GPU ID - On a local install this selects your physical GPU. On the cloud AI Toolkit on RunComfy, leave this at the default; the actual GPU type (H100 / H200, etc.) is chosen later when you start the job from the Training Queue.
Trigger Word - If you put a word here, AI Toolkit will prepend it to all captions in your dataset at training time (without permanently editing your files). This is handy if your captions don’t already have a consistent trigger. Use a nonsense token that the base model doesn’t already know (e.g. sks_char_neo), so the LoRA doesn’t compete with existing meanings.
6.2 MODEL panel: choosing and loading the base model
Model Architecture is where you pick from the model list (Flux, Z‑Image Turbo, Wan 2.2, Qwen‑Image, etc.). When you choose one:
- AI Toolkit loads a preset configuration tailored to that model: sampling type, noise schedule defaults, sometimes adapter paths.
Name or Path lets you override the default Hugging Face / model hub path:
- Leave it blank or default → AI Toolkit downloads the default base model.
- Point it to a local path → AI Toolkit uses your custom checkpoint (e.g. a Flux finetune you like).
If the model is gated (Flux.1‑dev, Flux.2‑dev, some Wan variants, etc.), you must accept the license and set HF_TOKEN in a .env file so AI Toolkit can download it.
Depending on the model, you’ll also see extra flags like Low VRAM or Layer Offloading here or in closely related panels:
Low VRAMcompresses and offloads parts of the model so it fits on smaller GPUs, at the cost of speed.Layer Offloadingaggressively shuffles parts of the model between CPU and GPU; only use it if standard Low VRAM isn’t enough, as it can be slower and occasionally less stable.
These switches don’t change what the LoRA learns; they just change how AI Toolkit packs the base model into memory, mainly trading speed and stability for the ability to fit the model on your hardware.
6.3 QUANTIZATION panel: precision vs VRAM
The QUANTIZATION panel usually has:
Transformer(e.g.float8,6-bit,4-bit,3-bit ARA),Text Encoder(typicallyfloat8 (default)).
What they mean:
- The transformer is the big, heavy part of the model that processes image latents and cross‑attention with text.
- The text encoder turns prompts into token embeddings.
Quantizing the transformer:
float8is the safest and most precise; it uses more VRAM but has minimal quality loss.6-bitis a strong compromise for 24 GB GPUs; small quality hit for decent savings.4-bitand3-bit ARAare more aggressive;3-bit ARAcombines 3‑bit weights with an accuracy recovery adapter that partially restores precision.
Quantizing the text encoder:
- Text encoders are much smaller, so they’re usually kept at
float8. - Some advanced setups freeze or unload the text encoder entirely (see
Unload TEandCache Text Embeddingslater); in that case, its quantization matters less.
Practically:
- On a 24 GB GPU fine‑tuning Flux or Wan,
Transformer = 6-bit,Text Encoder = float8is a very workable starting point. - If you have 48 GB+, stick to
float8everywhere unless you need the extra memory for very high resolutions or video frame counts.
6.4 TARGET panel: LoRA type and rank
The TARGET panel describes the adapter you’re training:
Target Type- UsuallyLoRA. Some builds also showLoKr(Low‑Rank Kronecker), a slightly different scheme that can be more parameter‑efficient but is not universally supported by every inference tool. For maximum compatibility—especially if you plan to use your LoRA in many different ComfyUI or Automatic1111 setups—LoRAis the safe default**.Linear Rank- This is the LoRA rank we discussed earlier: higher rank means more capacity, a larger LoRA file, more VRAM usage, and a higher risk of overfitting on small datasets. Intuition for modern diffusion transformers (Flux, Z‑Image Turbo, Wan 2.x, Qwen‑Image, OmniGen2, Chroma, Lumina2, etc.):- 8–16: compact and generalizing. This is a good starting range for strong bases like Z‑Image Turbo and many SDXL / SD 1.5 setups, especially when your dataset is small (5–40 images or a few short clips).
- 16–32: typical range for larger‑capacity style/character LoRAs on models like Flux, Wan 2.x, Qwen and other big image/video backbones. In practice you usually start at 16 and only push to 32 if you have enough data and the LoRA still feels too weak.
- 64+: rarely necessary. Only consider ranks this high if you have a large, diverse dataset and you intentionally want a very strong style or domain shift and have plenty of VRAM; most published AI Toolkit recipes never need to go this high.
On SD 1.5 / SDXL you might also see a Conv Rank (convolution rank), which focuses more on texture and style layers. Higher Conv Rank emphasizes how the image is rendered (brush strokes, noise pattern), while Linear Rank leans more on what is in the image.
6.5 SAVE panel: checkpoint precision and save frequency
SAVE controls how your LoRA checkpoints are written:
Data TypeBF16(bfloat16) is a great default: numerically stable and efficient.FP16is slightly more precise but not noticeably different for typical LoRAs.FP32is very precise and very heavy; use only if you know you need it.Save EveryThe number of steps between checkpoints. If you set
Save Every = 250andSteps = 3000, you’ll potentially get 12 checkpoints (but see the next field). You’ll usually wantSave Everyto **matchSample Everyin the SAMPLE panel so that each checkpoint has matching previews.Max Step Saves to KeepHow many of those checkpoints to keep on disk. If this is
4, only the 4 most recent ones are preserved; older ones are deleted to save space.
6.6 TRAINING panel: optimizer, steps, and noise schedule
Batch Size and Gradient Accumulation
As mentioned earlier:
Batch Size= images/clips per forward pass.Gradient Accumulation= how many such passes you stack before one optimizer update.
If VRAM is tight, you might do:
Batch Size = 1,Gradient Accumulation = 4→ behaves like batch size 4 but takes four times as many passes.
Always ensure your effective batch size is no larger than your dataset size; you never want to ask for 16 images per step when you only have 10 total.
Steps
This is total optimizer steps, not "epochs".
- 2000–3000 steps for Flux / Qwen / Z‑Image Turbo / OmniGen2 / Chroma (and many Wan 2.x LoRAs) is a common baseline for 20–50 image or small‑clip datasets.
It’s often better to train a bit less and keep a mid‑run checkpoint than to push to absurd step counts and hope the last one is best.
Optimizer (Optimizer)
You’ll typically see:
AdamW8Bit– AdamW with 8‑bit optimizer states. This saves memory and works very well for small‑to‑medium datasets.Adafactor– more memory‑efficient, scales to massive datasets, but can be trickier to tune.
For most LoRAs in AI Toolkit, AdamW8Bit is the right choice** unless you’re hitting optimizer‑state OOM errors.
Learning Rate
A good default is 0.0001. If:
- the LoRA barely seems to learn, you can try
0.00015–0.0002, - you see rapid overfitting or noisy samples, try
0.00005–0.00008.
Avoid jumping straight to high rates like 0.0005 unless a model‑specific guide tells you to (e.g. some experimental Turbo configs).
Weight Decay
As described before, 0.0001 is a nice "gentle regularization" default. If your LoRA is clearly memorizing training images even at modest steps, nudging this higher is one of the tools you have.
Timestep Type and Timestep Bias
These two parameters shape which diffusion timesteps your training batches focus on.
Timestep Typecan be:Linear– sample timesteps evenly across the whole noise range.Sigmoid– concentrate on mid‑range timesteps (good for faces/characters).Weightedor other presets – model‑specific schedules.Timestep Biascan be:Balanced– no extra bias; matches theTimestep Typedistribution.High Noise– skew toward early timesteps (very noisy latents); emphasizes global structure and composition.Low Noise– skew toward later timesteps (almost clean images); emphasizes fine textures.
For character LoRAs on FlowMatch models, Weighted + Balanced is a very solid starting point: the LoRA learns the concept where the model is "halfway" through denoising, which tends to match what you see at inference.
Sampler / Noise Type in training
On older SD models, AI Toolkit uses DDPM‑style samplers; for FlowMatch models like Flux, Z‑Image Turbo, Wan 2.x, it uses FlowMatch samplers by default. You normally don’t need to change this—the model preset sets the appropriate Timestep Type and sampler internally.
EMA (Exponential Moving Average)
Use EMAtoggles whether AI Toolkit keeps a smoothed copy of the LoRA weights over time.- If enabled,
EMA Decay(e.g. 0.99) controls how quickly the EMA forgets old updates: - 0.9 = reacts quickly, less smooth.
- 0.99 = smoother.
- 0.999+ = very smooth but slow to adapt.
EMA can improve stability on small datasets but consumes extra memory. On tight VRAM budgets, it’s reasonable to keep Use EMA off unless a specific guide recommends it.
Text Encoder Optimizations
Unload TE– unloads the text encoder from VRAM between steps. Saves memory but forces frequent re‑loading from disk, which can be slow on HDDs.Cache Text Embeddings– runs the text encoder once per caption, then stores the embeddings; later steps reuse those embeddings without re‑running the encoder. This trades disk space for speed/VRAM.
For most workflows:
- If you have enough VRAM: leave both off.
- If you’re tight on VRAM but have fast SSD storage and your captions are effectively static (no Differential Output Preservation, no on‑the‑fly
[trigger]rewriting, no heavy caption dropout that depends on per‑step text changes), turn onCache Text Embeddingsso AI Toolkit can encode each caption once and free the text encoder. - If you are using features that modify prompts each step — for example Differential Output Preservation (DOP), dynamic trigger substitution in captions, or any setup that relies on per‑step caption dropout behaviour — keep
Cache Text Embeddings= OFF** even when VRAM is tight, so the text encoder can re‑encode the real prompt every batch. - Only use
Unload TEwhen absolutely necessary (for very narrow trigger‑only LoRAs where dataset captions are ignored), since it completely disables caption‑based training.
6.7 ADVANCED / Regularization panel: DOP and Differential Guidance
Differential Output Preservation (Differential Output Preservation)
When you toggle this on, you’re asking AI Toolkit to:
- Run both the base model and the LoRA‑augmented model on a set of "regularization" images.
- Add a loss term that penalizes the LoRA for changing outputs that should remain unchanged.
Controls:
DOP Loss Multiplier– how strong this preservation loss is; 0.1–1.0 is typical. Think of 1.0 as "take this preservation very seriously".DOP Preservation Class– a text label describing what you’re trying to protect, like"person"or"truck". This helps the text encoder understand the regularization captions.
To use DOP effectively you must:
- Have at least one dataset marked as
Is Regularizationin the DATASETS panel. - Caption those images without your LoRA trigger word (these are "generic" examples).
Good scenarios for DOP:
- Your character LoRA makes every person look like your subject.
- Your product LoRA turns all logos into your brand, even when you don’t use the trigger word.
Blank Prompt Preservation is a variant where the regularization runs with empty prompts, encouraging the LoRA not to disturb basic "unprompted" behavior.
Do Differential Guidance (Do Differential Guidance)
Primarily used for Z‑Image Turbo LoRAs:
- AI Toolkit compares base and adapated outputs and uses a difference signal to sharpen what the LoRA should change.
Differential Guidance Scalecontrols how strongly this difference influences the training updates; the Hugging Face Z‑Image Turbo LoRA guide uses example values that work well in practice.
Enabling Differential Guidance:
- Helps Z‑Image Turbo LoRAs adapt deeply despite the underlying few‑step distillation.
- Works best when combined with cached text embeddings and carefully tuned learning rates and steps.
For non‑turbo models (Flux, Qwen, SDXL), you usually leave Do Differential Guidance off unless a model‑specific tutorial says otherwise.
6.8 DATASETS panel: what you actually train on
Each dataset block in the DATASETS panel corresponds to one dataset from the Datasets page.
Key fields:
Target Dataset– which dataset this block refers to.LoRA Weight– relative importance of this dataset compared to others in the same job.Default Caption– fallback caption applied when an image has no caption file.Caption Dropout RateNum Frames(for video models)Cache LatentsIs RegularizationFlip X,Flip YResolutions(256–1536 buckets)
What they mean in practice:
- Combining datasets with
LoRA WeightIf you have multiple datasets (e.g. "character close‑ups" and "full‑body shots"), you can balance them by giving one a higher
LoRA Weight. A dataset with weight 2 will be sampled roughly twice as often as one with weight 1. Default CaptionandCaption Dropout RateDefault Captionis useful if you forgot to caption some images and want to give them at least a minimal description (including the trigger word).Caption Dropout Raterandomly removes or blanks captions for some training examples:- Near 0 → the LoRA learns a strong dependency on the caption.
- Near 1 → the LoRA behaves more like a "style always on" modifier.
Is RegularizationMark this when the dataset should be used for DOP / regularization, not as main training data. These images should not contain your trigger word and usually cover generic examples (other people, trucks, etc.).
Cache LatentsWhen enabled, AI Toolkit pre‑computes latent encodings of your images and saves them, so later training steps don’t have to re‑encode each image. Training speeds up, but your disk usage jumps: hundreds or thousands of images at high resolution can consume tens of gigabytes. You’ll need to manually clean these latents if you don’t want them persisting forever.
Num Frames(video only)For Wan/HiDream LoRAs, this decides how many frames are sampled from each clip during training. More frames → better motion learning but higher VRAM; presets generally choose sensible defaults per model.
Flip XandFlip YAutomatic data augmentation:
Flip X(horizontal flip) doubles your dataset but mirrors everything, including asymmetrical features and text.Flip Y(vertical flip) rarely makes sense for realistic images.ResolutionsThese define which image sizes AI Toolkit will "bucket" your images into. It only shrinks images to fit the nearest bucket; it never upscales. If you enable, say, 768 and 1024:
- 900×900 images → shrunk to 768×768.
- 1200×1200 images → shrunk to 1024×1024.
6.9 SAMPLE panel: seeing your LoRA learn in real time
The SAMPLE panel defines how AI Toolkit generates preview images or videos during training.
Top‑level fields:
Sample Every– how many steps between previews.Sampler–FlowMatchorDDPM, depending on model.Width/Height– preview resolution.SeedandWalk Seed.Guidance Scale.Num FramesandFPS(for video previews).Sample Steps.- Advanced toggles:
Skip First Sample,Force First Sample,Disable Sampling.
Below that, you can add multiple Sample Prompts, each with its own prompt text, optional per‑prompt resolution/seed, LoRA Scale, and an optional control image.
How this ties back to training:
Sample EveryvsSave Every: It’s best if these two match so that every saved checkpoint has a corresponding set of preview images. If you change one, change the other.Sampler: Stick to the sampler recommended by the model preset:FlowMatchfor Flux, Z‑Image Turbo, Wan, OmniGen2, etc.DDPMfor SD 1.5 / SDXL.- Preview resolution and steps
- 1024×1024 with
Sample Steps = 20–25gives clear previews without being too slow for most image models. - For video, higher
Num FramesandFPSproduce more realistic previews but are heavy; presets are usually tuned per model. - Seeds and
Walk Seed - A fixed
SeedwithWalk Seedoff means every checkpoint uses exactly the same random noise, so you can directly compare how the LoRA’s outputs evolve. - Enabling
Walk Seedincrements the seed per prompt, adding variety. Nice for browsing, but slightly harder to compare step‑by‑step.
In practice, many users:
- keep
Sample Every = Save Every = 250, - set 3–6 sample prompts covering typical use cases,
- keep at least one prompt that is identical across all checkpoints so they can visually track convergence.
7. Step‑by‑step example: training a LoRA with Ostris AI Toolkit
To make this concrete, here is an end‑to‑end example you can adapt to any supported image model (Flux, Omnigen2, Z‑Image Turbo, Qwen‑Image, etc.). I’ll keep numbers in safe ranges rather than hyper‑optimized for any one model.
Step 1 – Prepare your dataset
- Collect 25–40 high‑quality images of your concept (a person, a product, a style).
- Resize or crop them so the main subject is visible and not tiny in the frame.
- Caption each image with:
- a unique trigger word (e.g.
sks_char_neo), - a concise description:
"portrait photo of sks_char_neo, studio lighting, 35mm lens".
Step 2 – Create a dataset in AI Toolkit
- Go to Datasets → New Dataset in the UI.
- Upload your images (and caption files or JSONL if you have them).
- Confirm that the dataset shows the correct number of images and a reasonable resolution distribution (most near 768–1024 on modern models).
Optionally:
- Create a second dataset of generic people or objects (similar class but not your subject) if you think you’ll need DOP later; leave
Is Regularizationoff for now—you can enable it when you decide to use it.
Step 3 – Configure a new LoRA job
On the New Job page:
- JOB
Training Name:flux_sks_char_neo_v1(or similar).GPU ID: leave at defalult unless you know you need another.Trigger Word:sks_char_neo(only if your captions don’t already include it).- MODEL
Model Architecture: your chosen base (e.g. FLUX.1, Z‑Image Turbo, Qwen‑Image).Name or Path: leave default unless you have a specific checkpoint.- Enable
Low VRAMonly if VRAM is tight. - QUANTIZATION
Transformer:6-biton 24 GB GPUs,float8if you have headroom.Text Encoder:float8 (default).- TARGET
Target Type:LoRA.Linear Rank: 32 for most models; 16 if VRAM is tight or the base is extremely strong.- SAVE
Data Type:BF16.Save Every: 250.Max Step Saves to Keep: 4.- TRAINING
Batch Size: 1.Gradient Accumulation: 4.Steps: 3000.Optimizer:AdamW8Bit.Learning Rate:0.0001.Weight Decay:0.0001.Timestep Type:Sigmoidor the model’s recommended default.Timestep Bias:Balanced.Use EMA: off unless you have plenty of memory.- ADVANCED / Regularization
- Leave
Differential Output PreservationandDo Differential Guidanceoff for your first run unless your model requires it (Z‑Image Turbo is the main one that benefits from Differential Guidance out of the box). - DATASETS
Target Dataset: your main dataset.LoRA Weight: 1.Default Caption: leave empty if all images already have captions.Caption Dropout Rate:0.0–0.1so the LoRA strongly relies on your trigger word.Cache Latents: optional; turn on if you’re fine with extra disk usage and want faster training.Is Regularization: off for this main dataset.Resolutions: enable 768 and 1024 (or as your GPU allows).- SAMPLE
Sample Every: 250 (match Save).Sampler: use the default (FlowMatchorDDPMdepending on model).Width/Height: 1024×1024.Seed: any fixed number (42 is fine); setWalk Seedto off if you want directly comparable previews.Guidance Scale: use the model’s suggested default.Sample Steps: 20–25.- Add 3–5
Sample Prompts:
Click Create Job. The job appears in Training Queue; open its logs to confirm it starts correctly.
Step 4 – Monitor samples and adjust
Each time you hit a multiple of 250 steps, AI Toolkit will:
- save a new checkpoint,
- generate sample images for your prompts.
Watch for:
- Underfitting – early checkpoints look identical to the base model; the trigger word barely changes anything.
→ Consider increasing
Stepsslightly (restart training with 4000) or bumpLearning Ratea bit (e.g. 0.0001 → 0.00015). - Overfitting / bleeding – outputs become almost photo copies of your training images, or your trigger word starts hijacking generic prompts.
→ Try a lower
Linear Rank, fewerSteps, slightly higherWeight Decay, or enable DOP with a carefully prepared regularization dataset.
Once you see a checkpoint that consistently looks good across several prompts, note its step number.
Step 5 – Export and use your LoRA
From the Training Queue or from your AI Toolkit output folder:
- Download the best checkpoint (a
.safetensorsLoRA file). - If you’re using the RunComfy Cloud AI Toolkit, these LoRA files will also be stored on your Custom Models page, so you can copy the model link, download them, and test them in the model playground or ComfyUI.
8. Troubleshooting AI Toolkit LoRA training: common errors and fixes
Dataset not found or empty
Symptoms:
- Job exits immediately.
- Logs mention "no images found" or similar.
Checks:
- In Datasets, confirm the dataset shows the expected image count.
- Ensure
Target Datasetin the job matches the correct dataset. - If using JSONL metadata, verify the file is present and correctly formatted.
Base model download / Hugging Face errors
Symptoms:
- 403 / 404 errors when downloading the model.
- Log messages about missing access.
Fixes:
- Accept the model’s license on Hugging Face if it’s gated (Flux dev, some Wan variants) as described in the.
- Add
HF_TOKEN=your_read_tokento a.envfile in the AI Toolkit root.
CUDA out‑of‑memory during training or sampling
Symptoms:
- "CUDA out of memory" errors when the job starts, or when generating samples.
Options:
- In DATASETS:
- Disable high resolutions (1280, 1536) and stick to 768/1024.
- In TARGET:
- Lower
Linear Rank(32 → 16). - In QUANTIZATION / MODEL:
- Turn on
Low VRAM. - Use a more aggressive transformer quantization (float8 → 6‑bit).
- In TRAINING:
- Reduce
Batch SizeorGradient Accumulation. - In SAMPLE:
- Lower preview resolution and
Sample Steps, - Reduce
Num Framesfor video previews.
If you’re running in RunComfy Cloud AI Toolkit, the easy escape hatch is to bump the job to a higher‑VRAM GPU tier and re‑run it, often dropping some of the aggressive quantization / Low VRAM settings and using a simpler, faster config. With more VRAM and fewer memory‑saving hacks, each step runs quicker and you can iterate through more checkpoints instead of spending time micromanaging VRAM.
LoRA overfits and hijacks the base model
Symptoms:
- Every person looks like your subject.
- All trucks look like your specific product, even without trigger word.
Mitigations:
- Lower
Linear Rank. - Use an earlier checkpoint (e.g. 2000 steps instead of 3000).
- Slightly increase
Weight Decay. - Add a regularization dataset of similar‑class examples (
Is Regularization= on). - Enable
Differential Output Preservationwith a reasonableDOP Loss Multiplier(e.g. 0.2–0.5) and a suitableDOP Preservation Class("person","truck", etc.).
Ready to start training?

