AI Toolkit LoRA Training Guides

AI Toolkit Safe Starter Settings to Avoid OOM

Preflight guide for AI Toolkit jobs: check batch size, resolutions, frames, gradient checkpointing, and preview sampling before you create a job, so your first run is more likely to succeed.

Train Diffusion Models with Ostris AI Toolkit

How to Avoid OOM in AI Toolkit: Safe Starter Settings for First Successful Runs

This page is not the “max speed” setup.

It is the first successful run setup.

If your goal is to stop wasting retries, reduce OOMs, and get to a usable training run faster, start here.

The rule is simple:

Optimize for proof of stability first. Optimize for speed later.

What this guide is for

Use this page if:

  • you are about to create a new AI Toolkit job
  • you want safer starting settings
  • you would rather get a stable first run than spend hours debugging OOM
  • you want a practical “don’t-start-with-dangerous-settings” checklist

If you are already seeing CUDA out of memory error, go to:


60-Second OOM Preflight Checklist

Before you click Create Job:

  • ✅ Keep Batch Size conservative
  • ✅ In Datasets, start with conservative Resolutions
  • ✅ In Sample, keep preview cheaper than your final ambition
  • ✅ Click Show Advanced and make sure gradient_checkpointing: true
  • ✅ For video, start with conservative Num Frames
  • ✅ Use model-specific low-memory features only if the model guide recommends them
  • ✅ Do not try multiple risky changes in your first run

RunComfy also helps with this at the product level. When you save a training job, RunComfy checks whether your current settings may include high-risk combinations — for example, overly aggressive batch size, frames, resolution, or turning off memory-saving defaults too early. The goal is simple: help you catch risky configs before they burn GPU time, cost money, or send you into hours of avoidable trial and error.

That does not replace model-specific judgment, but it gives you a safer starting point and makes it easier to begin training efficiently.


1) The most important mindset shift

Most failed first runs are not caused by “bad learning rate.”

They are caused by:

  • too much resolution
  • too many frames
  • too much batch
  • too expensive preview sampling
  • turning off memory-saving defaults too early

So your first successful run should look intentionally boring.

That is a good thing.


2) Safe starter settings for image models

FLUX-dev / Flex-like large image models

Good first run

  • Batch Size: 1
  • Gradient Checkpointing: ON
  • Datasets > Resolutions: start with 512 + 768
  • add 1024 only after stability
  • Sample: keep preview moderate, or temporarily disable sampling if you are just validating the run

Do not start here

  • GC OFF
  • Batch Size ≥ 8
  • aggressive multi-bucket high-res setup on run 1
  • heavy previews every short interval

Z-Image

Good first run

  • Batch Size: conservative first
  • Gradient Checkpointing: ON
  • Resolutions: 768 + 1024 is a safer first target than jumping straight to the biggest bucket
  • keep previews reasonable

Do not start here

  • GC OFF with bigger batch
  • starting directly at the largest bucket
  • mixing a high batch with high resolution before stability is proven

Qwen Image Edit

Good first run

  • Batch Size: 1
  • Gradient Checkpointing: ON
  • start with a smaller or simpler bucket mix
  • keep preview cost controlled
  • use the model’s intended memory-saving path if the guide recommends it

Do not start here

  • GC OFF
  • bigger batch on the first run
  • expensive 1024 previews plus heavy conditioning plus frequent sample generation
  • random text-encoder experiments before the basic pipeline is stable

3) Safe starter settings for video models

Wan 2.2 14B

Good first run

  • Batch Size: 1
  • Datasets > Num Frames: 21 or 41
  • Datasets > Resolutions: start with 512
  • add 768 only after a stable run
  • keep preview videos conservative

Do not start here

  • 81 frames + Batch Size 2
  • long video previews during training
  • large buckets plus long clips before stability is proven

LTX-2

Good first run

  • Batch Size: 1
  • Num Frames: 49 or 81
  • Resolution: 512
  • keep preview cost under control

Do not start here

  • 121 frames + Batch Size 4
  • bigger buckets before a proven stable run
  • assuming image-model batch habits carry over to video

4) Safer preview settings than most users start with

A lot of “training OOM” is actually preview OOM.

So for your first run, use cheaper sampling than you think you need.

In the Sample panel

Prefer:

  • lower Width / Height
  • lower Sample Steps
  • less frequent Sample Every
  • Disable Sampling ON if your only goal is to prove training stability

Once the run is stable, you can make previews richer again.


5) What to verify in Show Advanced

The standard UI covers many important knobs, but your safest preflight check is still the advanced YAML.

Look for these first:

train:
  batch_size: 1
  gradient_checkpointing: true
  disable_sampling: false

model:
  low_vram: false

sample:
  width: 1024
  height: 1024
  sample_steps: 25
  guidance_scale: 4
  num_frames: 1

datasets:
  - resolution: [512, 768, 1024]
    num_frames: 1

For a safer first run, the things you most commonly reduce are:

  • batch_size
  • resolution
  • num_frames
  • sample.width
  • sample.height
  • sample.sample_steps

And the thing you most commonly make sure is still enabled is:

  • gradient_checkpointing: true

6) “Do not start here” combinations

These are exactly the kinds of first-run choices that create avoidable OOMs:

Risky combo Why it is risky
Gradient Checkpointing = OFF on large image models easy way to lose VRAM headroom immediately
FLUX-like image model + Batch Size 8+ high-risk first run, especially with richer buckets
Wan 2.2 + 81 frames + Batch Size 2 classic video memory spike territory
LTX-2 + 121 frames + Batch Size 4 extremely heavy first-run combination
expensive 1024 previews every short interval preview OOM even if training almost fits
adding multiple risky changes at once you won’t know what actually caused failure

7) A very practical first-run recipe

If you only want one rule:

For image models

  1. Batch Size = 1
  2. gradient_checkpointing: true
  3. keep only the smaller / medium buckets first
  4. cheap preview or no preview
  5. prove the job runs

For video models

  1. Batch Size = 1
  2. conservative Num Frames
  3. 512 first
  4. cheap preview
  5. prove the job runs

That is the fastest path to a real successful run.


8) When to scale up

Only scale up after one stable run.

Good order:

  1. keep the same memory settings
  2. increase Steps
  3. improve preview quality
  4. add a larger bucket
  5. add more frames (video)
  6. only then test a larger batch

One variable at a time.


9) If your job still OOMs anyway

Go directly to the runtime fix guide:

That page is for jobs that have already failed.

This page is for avoiding the failure in the first place.


One-line summary

The best first-run AI Toolkit preset is the one that is slightly conservative, clearly stable, and easy to scale up later.

Start safe.

Get one successful run.

Then optimize.


Related guides

Ready to start training?