Why Turning Off Gradient Checkpointing Causes OOM in AI Toolkit
If you are not completely sure why Gradient Checkpointing is off, the safest default is simple:
Turn it back on.
For large AI Toolkit image models, turning gradient_checkpointing off is one of the fastest ways to convert a stable run into an OOM run.
This guide explains:
- what Gradient Checkpointing is doing
- why turning it off can suddenly blow up VRAM
- which model families are the most sensitive
- when it is actually reasonable to test GC OFF
- what to change instead if your real goal is speed
Quick Answer
Gradient checkpointing is mainly a memory vs speed trade.
- ON = lower peak VRAM, usually safer
- OFF = higher peak VRAM, sometimes faster, much easier to OOM
For most users, especially on large image models, it is not a quality knob. It is a stability knob.
Quick Fix Checklist
- ✅ Click Show Advanced
- ✅ Under
train:, make suregradient_checkpointing: true - ✅ Retry with the same Batch Size and Resolutions first
- ✅ If you still OOM, lower Batch Size and drop the highest Resolution bucket
- ✅ If your real problem is preview sampling, reduce Sample Width / Height / Sample Steps instead of turning GC off
1) What Gradient Checkpointing is actually doing
You do not need the deep PyTorch theory to use it correctly.
A practical way to think about it:
- with GC ON, AI Toolkit keeps less intermediate activation data in memory and recomputes some of it when needed
- with GC OFF, more of that data stays resident in VRAM, so training can move faster if memory is available
That sounds fine until you stack it with:
- high resolution
- multiple resolution buckets
- larger batch size
- expensive sampling
- heavy architectures like Qwen Edit, Z-Image, FLUX / Flex-class models
Then the extra memory headroom disappears very quickly.
2) Why turning it off causes “sudden” OOM
Users often assume:
“I only changed one toggle.”
But that one toggle changes how much activation memory is retained during training.
So a config that was merely “heavy but okay” with GC ON becomes “borderline or impossible” with GC OFF.
This is why the failure can feel abrupt:
- same dataset
- same model
- same preview prompts
- same batch size
- only GC changed
- now it OOMs at step 2, step 3, or during the first preview
3) High-risk model families when GC is OFF
Based on repeated real-world AI Toolkit OOM patterns, these are the combinations to treat as dangerous by default:
| Model family | High-risk when GC is OFF | Safer first retry |
|---|---|---|
| Z-Image | 1024–1536 buckets with bigger batches | GC ON, start conservative, then scale up |
| Qwen-Edit | 1024 workflows with larger batch / multiple heavy conditions | GC ON, Batch Size 1, reduce preview burden |
| FLUX-dev / Flex-like large image models | bigger batches or high-res multi-bucket training | GC ON, Batch Size 1–4 depending headroom |
A useful mental model:
- Z-Image gets dangerous quickly when you combine high resolution + larger batch + GC OFF
- Qwen Edit gets dangerous quickly when you combine 1024 + heavy conditioning + GC OFF
- FLUX / Flex-like large image models get dangerous quickly when you combine larger batch + GC OFF
4) Where to change it in RunComfy AI Toolkit
In the current RunComfy UI, users can directly see things like:
- Batch Size
- Gradient Accumulation
- Steps
- Unload TE
- Cache Text Embeddings
But gradient_checkpointing is often easiest to verify through the advanced config.
Step by step
- Open your job
- Click Show Advanced
- Find the
train:block - Set:
train:
gradient_checkpointing: true
- Save the job
- Retry before changing anything else
If the same config now runs, you have confirmed GC was the problem.
5) What to do instead if your real goal is speed
A lot of users turn GC off because they want faster training.
That is understandable — but it is usually the wrong first speed experiment.
Try these before GC OFF:
A. Make preview sampling cheaper
In the Sample panel:
- reduce Width / Height
- reduce Sample Steps
- increase Sample Every
- temporarily toggle Disable Sampling ON to verify training stability
This often gives you more practical speed than chasing a risky GC-OFF config.
B. Keep batch small, scale with accumulation if needed
In the Training panel:
- keep Batch Size conservative
- increase Gradient Accumulation only if you want a slightly larger effective batch without a large peak VRAM jump
C. Drop the highest bucket first
In Datasets:
- keep 512 / 768
- reintroduce 1024 / 1536 only after a stable run
D. Use the model’s intended low-memory path
For some architectures, the correct route is not “GC OFF,” but:
low_vram: true- model-specific quantization
- model-specific text-encoder optimization
Use the model guide, not a generic guess.
6) When is it actually okay to test GC OFF?
Treat GC OFF as an advanced experiment, not a default.
Reasonable conditions:
- you already have a stable run
- Batch Size is still conservative
- you are not already on the edge with the largest bucket
- previews are cheap or disabled
- you have enough VRAM headroom
- you are changing one variable, not many
A good test flow:
- stabilize the job with GC ON
- keep the rest of the config identical
- turn GC OFF
- watch for:
- early-step OOM
- sampling OOM
- intermittent bucket spikes
If any of those happen, turn it back on and stop there.
7) What GC OFF is not
Gradient Checkpointing is not:
- a magic quality booster
- a likeness improvement toggle
- the right first fix for bad samples
- a good first experiment when you are already OOM
If your samples are weak, look at:
- dataset quality
- captions
- rank
- steps
- preview parity
Do not assume GC OFF is the answer.
8) FAQ
Does turning GC ON hurt output quality?
Usually that is not the practical trade users notice.
The real trade is mostly memory vs speed.
I turned GC ON and I still OOM. Now what?
Then your next levers are:
- Batch Size
- largest Resolution bucket
- preview sampling cost
- for video, Num Frames
Should I ever start a first run with GC OFF?
For most users: no.
Prove stability first, then experiment.
One-line summary
If you are getting AI Toolkit OOMs and gradient_checkpointing is off, fix that first.
GC ON is the safe default. GC OFF is an advanced speed experiment.
Related guides
Ready to start training?
