AI Toolkit LoRA Training Guides

Fix CUDA Out of Memory Errors in AI Toolkit

Quick troubleshooting guide for AI Toolkit OOM failures: identify whether the crash happens during model load, training, or sampling, then adjust batch size, gradient checkpointing, resolutions, frames, or preview settings before retrying.

Train Diffusion Models with Ostris AI Toolkit

AI Toolkit CUDA Out of Memory? Fix OOM During Model Load, Early Training, and Sampling

If your AI Toolkit job fails with CUDA out of memory, OOM during training step 3 times in a row, or 0 bytes is free, do not keep re-running the same job unchanged.

In practice, most AI Toolkit OOM failures come from one of four places:

  1. model load (before real training starts)
  2. the first few training steps
  3. preview sampling / baseline sample generation
  4. video-specific spikes from too many frames, too-large buckets, or both

This guide is the fast recovery path: identify which OOM you have, change the right settings in RunComfy AI Toolkit, and get to a successful retry faster.


Quick Fix Checklist (start here)

  • ✅ In Training, reduce Batch Size
  • ✅ In Datasets, disable the largest Resolution bucket first
  • ✅ In Sample, reduce Width / Height / Sample Steps, or temporarily turn Disable Sampling ON
  • ✅ Click Show Advanced and make sure gradient_checkpointing: true
  • ✅ For video models, reduce Num Frames before you touch learning rate
  • ✅ If the error happens even with a very conservative config, treat it as a possible worker / GPU state issue, not just a config issue

1) Confirm this is the same issue

You are in the right place if your logs include messages like:

CUDA out of memory
torch.OutOfMemoryError
OOM during training step 3 times in a row
Tried to allocate ...
0 bytes is free
CUBLAS_STATUS_ALLOC_FAILED

Common situations:

  • the job fails before step 1
  • the job reaches step 2–10, then repeatedly OOMs
  • training seems fine, but the crash happens while generating samples
  • the same config sometimes works, sometimes fails

2) First: what kind of OOM is it?

A. OOM during model load or before training starts

This usually means one of these:

  • the model itself is too heavy for the current memory-saving setup
  • preview / baseline sample generation is already too expensive
  • the worker / GPU is in a bad state and not actually starting from clean memory

Typical signs:

  • failure before meaningful training steps begin
  • the error happens immediately after model loading or during the first sample
  • logs mention almost no free VRAM, or throw a CUBLAS allocation error

B. OOM in the first few training steps

This is the most common config-driven case.

Typical causes:

  • gradient_checkpointing is off
  • Batch Size is too high
  • the largest dataset bucket is too ambitious
  • for video, Num Frames is the real memory spike

C. OOM during sampling / preview generation

This is a very common trap.

Your training config may be almost okay, but your preview is too expensive:

  • Sample Width / Height too large
  • Sample Steps too high
  • Sample Every too frequent
  • preview video uses too many frames

D. OOM only sometimes

This is usually a borderline config, not a mystery.

Examples:

  • the run survives smaller buckets, then crashes when it hits the largest bucket
  • video runs fail only on the heaviest clips
  • the training core fits, but sample generation pushes it over the edge

3) Fastest fixes inside RunComfy AI Toolkit

Fix A — Turn gradient checkpointing back on

This is the first thing to check on image-model OOMs.

Where to change it

  1. Open the failed job
  2. Click Show Advanced
  3. Under train:, make sure this is set:
gradient_checkpointing: true

If you are not sure what to do, leave it on.


Fix B — Lower Batch Size, use Gradient Accumulation for stability

Where to change it

  • Open the job editor
  • In the Training panel:
    • set Batch Size lower
    • keep or raise Gradient Accumulation if you want a slightly larger effective batch without increasing peak VRAM

Safe retry rule

  • Image models: if you OOM, drop to Batch Size = 1 first
  • Video models: assume Batch Size = 1 is your default unless you have already proven the config is stable

Do not treat learning rate as your first memory lever. It usually is not.


Fix C — Drop the highest dataset bucket first

Where to change it

  • Go to the Datasets panel
  • Under Resolutions, disable the highest bucket first

Safe rollback order

  • 1024 / 1536 → remove first
  • keep 512 / 768 while you verify stability
  • once the job is stable, add larger buckets back one at a time

This is one of the fastest ways to turn a borderline run into a repeatable run.


Fix D — Make preview sampling cheap, or disable it temporarily

If the crash happens before training really gets going, or every time sampling runs, fix the preview first.

Where to change it

  • Open the Sample panel

Then do one or more of these:

  • reduce Width
  • reduce Height
  • reduce Sample Steps
  • increase Sample Every
  • toggle Disable Sampling ON for a validation run

Good first retry

If your goal is “prove the job can train,” a temporary no-sampling run is fine.

Once the job is stable, turn previews back on with smaller settings.


Fix E — Video models: reduce Num Frames before anything else

For video models, frames are usually the biggest memory lever.

Where to change it

  • Datasets panel → Num Frames
  • Sample panel → Num Frames

If you are training video and seeing OOM, reduce frames first, then batch size, then resolution.

Do not start by changing optimizer or LR.


Fix F — Use the model’s low-memory path

Some architectures are meant to be trained with memory-saving settings when VRAM is tight.

Where to change it

  • Click Show Advanced
  • Under model:, look for:
low_vram: true

For some models, the correct low-memory path also includes model-specific quantization or text-encoder handling. Follow the relevant model guide instead of guessing.


4) Quick diagnosis by failure timing

When it crashes Most likely cause First change
Before step 1 / during initial sample preview too heavy, model load pressure, or dirty worker disable sampling or shrink preview
Step 1–10 GC off, batch too high, bucket too large turn GC on, batch to 1, drop largest bucket
Only when sampling preview settings too expensive lower Width/Height/Sample Steps or disable sampling
Sometimes yes, sometimes no borderline config drop largest bucket / frames and stabilize
Even conservative config fails instantly possible GPU / worker state issue recreate on a fresh worker or contact support

5) How to tell config OOM from environment / GPU-state problems

Treat it as not just a config problem when all of these are true:

  • Batch Size = 1
  • gradient_checkpointing: true
  • conservative resolution / frames
  • the job still fails before training meaningfully starts
  • logs show things like 0 bytes is free or CUBLAS allocation failure

In that situation:

  1. stop repeating the exact same retry
  2. create a fresh job attempt on a fresh worker if possible
  3. if the same conservative config used to work and now fails immediately, escalate to support

This matters because repeated retries can waste both time and GPU budget.


6) The safest rollback order

For image models

  1. gradient_checkpointing: true
  2. Batch Size → 1
  3. drop the largest Resolution bucket
  4. shrink or disable Sample
  5. turn on low_vram or the model’s low-memory path

For video models

  1. reduce Num Frames
  2. Batch Size → 1
  3. drop the largest Resolution bucket
  4. shrink or disable Sample
  5. turn on low_vram or model-specific offload / quantization

7) After your first successful retry

Once you have a stable run:

  • add back only one heavier setting at a time
  • keep notes on what changed
  • do not reintroduce multiple high-risk settings together

Good order for scaling up:

  1. keep the same stable memory settings
  2. increase Steps
  3. re-enable a larger bucket
  4. re-enable richer sampling
  5. only then test larger batch or longer video

One-line summary

If AI Toolkit throws OOM, stop random trial-and-error.

Turn gradient_checkpointing on, lower Batch Size, drop the largest Resolution bucket, and make preview sampling cheaper before you retry.


Related guides

Ready to start training?