How to Avoid OOM in AI Toolkit: Safe Starter Settings for First Successful Runs
This page is not the “max speed” setup.
It is the first successful run setup.
If your goal is to stop wasting retries, reduce OOMs, and get to a usable training run faster, start here.
The rule is simple:
Optimize for proof of stability first. Optimize for speed later.
What this guide is for
Use this page if:
- you are about to create a new AI Toolkit job
- you want safer starting settings
- you would rather get a stable first run than spend hours debugging OOM
- you want a practical “don’t-start-with-dangerous-settings” checklist
If you are already seeing CUDA out of memory error, go to:
60-Second OOM Preflight Checklist
Before you click Create Job:
- ✅ Keep Batch Size conservative
- ✅ In Datasets, start with conservative Resolutions
- ✅ In Sample, keep preview cheaper than your final ambition
- ✅ Click Show Advanced and make sure
gradient_checkpointing: true - ✅ For video, start with conservative Num Frames
- ✅ Use model-specific low-memory features only if the model guide recommends them
- ✅ Do not try multiple risky changes in your first run
RunComfy also helps with this at the product level. When you save a training job, RunComfy checks whether your current settings may include high-risk combinations — for example, overly aggressive batch size, frames, resolution, or turning off memory-saving defaults too early. The goal is simple: help you catch risky configs before they burn GPU time, cost money, or send you into hours of avoidable trial and error.
That does not replace model-specific judgment, but it gives you a safer starting point and makes it easier to begin training efficiently.
1) The most important mindset shift
Most failed first runs are not caused by “bad learning rate.”
They are caused by:
- too much resolution
- too many frames
- too much batch
- too expensive preview sampling
- turning off memory-saving defaults too early
So your first successful run should look intentionally boring.
That is a good thing.
2) Safe starter settings for image models
FLUX-dev / Flex-like large image models
Good first run
- Batch Size:
1 - Gradient Checkpointing:
ON - Datasets > Resolutions: start with
512 + 768 - add
1024only after stability - Sample: keep preview moderate, or temporarily disable sampling if you are just validating the run
Do not start here
- GC OFF
- Batch Size ≥ 8
- aggressive multi-bucket high-res setup on run 1
- heavy previews every short interval
Z-Image
Good first run
- Batch Size: conservative first
- Gradient Checkpointing:
ON - Resolutions:
768 + 1024is a safer first target than jumping straight to the biggest bucket - keep previews reasonable
Do not start here
- GC OFF with bigger batch
- starting directly at the largest bucket
- mixing a high batch with high resolution before stability is proven
Qwen Image Edit
Good first run
- Batch Size:
1 - Gradient Checkpointing:
ON - start with a smaller or simpler bucket mix
- keep preview cost controlled
- use the model’s intended memory-saving path if the guide recommends it
Do not start here
- GC OFF
- bigger batch on the first run
- expensive 1024 previews plus heavy conditioning plus frequent sample generation
- random text-encoder experiments before the basic pipeline is stable
3) Safe starter settings for video models
Wan 2.2 14B
Good first run
- Batch Size:
1 - Datasets > Num Frames:
21or41 - Datasets > Resolutions: start with
512 - add
768only after a stable run - keep preview videos conservative
Do not start here
- 81 frames + Batch Size 2
- long video previews during training
- large buckets plus long clips before stability is proven
LTX-2
Good first run
- Batch Size:
1 - Num Frames:
49or81 - Resolution:
512 - keep preview cost under control
Do not start here
- 121 frames + Batch Size 4
- bigger buckets before a proven stable run
- assuming image-model batch habits carry over to video
4) Safer preview settings than most users start with
A lot of “training OOM” is actually preview OOM.
So for your first run, use cheaper sampling than you think you need.
In the Sample panel
Prefer:
- lower Width / Height
- lower Sample Steps
- less frequent Sample Every
- Disable Sampling ON if your only goal is to prove training stability
Once the run is stable, you can make previews richer again.
5) What to verify in Show Advanced
The standard UI covers many important knobs, but your safest preflight check is still the advanced YAML.
Look for these first:
train:
batch_size: 1
gradient_checkpointing: true
disable_sampling: false
model:
low_vram: false
sample:
width: 1024
height: 1024
sample_steps: 25
guidance_scale: 4
num_frames: 1
datasets:
- resolution: [512, 768, 1024]
num_frames: 1
For a safer first run, the things you most commonly reduce are:
batch_sizeresolutionnum_framessample.widthsample.heightsample.sample_steps
And the thing you most commonly make sure is still enabled is:
gradient_checkpointing: true
6) “Do not start here” combinations
These are exactly the kinds of first-run choices that create avoidable OOMs:
| Risky combo | Why it is risky |
|---|---|
| Gradient Checkpointing = OFF on large image models | easy way to lose VRAM headroom immediately |
| FLUX-like image model + Batch Size 8+ | high-risk first run, especially with richer buckets |
| Wan 2.2 + 81 frames + Batch Size 2 | classic video memory spike territory |
| LTX-2 + 121 frames + Batch Size 4 | extremely heavy first-run combination |
| expensive 1024 previews every short interval | preview OOM even if training almost fits |
| adding multiple risky changes at once | you won’t know what actually caused failure |
7) A very practical first-run recipe
If you only want one rule:
For image models
- Batch Size = 1
gradient_checkpointing: true- keep only the smaller / medium buckets first
- cheap preview or no preview
- prove the job runs
For video models
- Batch Size = 1
- conservative Num Frames
512first- cheap preview
- prove the job runs
That is the fastest path to a real successful run.
8) When to scale up
Only scale up after one stable run.
Good order:
- keep the same memory settings
- increase Steps
- improve preview quality
- add a larger bucket
- add more frames (video)
- only then test a larger batch
One variable at a time.
9) If your job still OOMs anyway
Go directly to the runtime fix guide:
That page is for jobs that have already failed.
This page is for avoiding the failure in the first place.
One-line summary
The best first-run AI Toolkit preset is the one that is slightly conservative, clearly stable, and easy to scale up later.
Start safe.
Get one successful run.
Then optimize.
Related guides
Ready to start training?
