AI Toolkit LoRA Training Guides

Fix: DataLoader worker Bus error (/dev/shm) — LTX-2 LoRA in AI Toolkit

This troubleshooting guide explains why LTX-2 LoRA training hits /dev/shm limits and how to fix it: lower DataLoader num_workers/prefetch, increase Docker shm-size, and reduce per-sample memory if needed.

Train Diffusion Models with Ostris AI Toolkit

Fix: “DataLoader worker is killed by signal: Bus error” (shared memory /dev/shm) — LTX-2 LoRA in AI Toolkit

If your LTX-2 LoRA training crashes with a Bus error, it’s almost always shared memory (/dev/shm) pressure caused by PyTorch DataLoader workers + prefetching holding large video batches (many frames) in memory.


Quick Fix Checklist (start here)

  • ✅ Apply Fix A: lower num_workers and prefetch_factor in your dataset config
  • ✅ If self-hosting in Docker: apply Fix B and increase /dev/shm with --shm-size
  • ✅ Re-run training and confirm the log no longer shows “out of shared memory” / “Bus error”
  • ✅ If you’re training 121 frames, be ready to also address GPU OOM (see notes below)

1) Confirm this is the same issue

You’re in the right place if your logs include something like:

DataLoader worker (pid XXX) is killed by signal: Bus error
It is possible that dataloader's workers are out of shared memory
Please try to raise your shared memory limit

Common keywords:

  • Bus error
  • out of shared memory
  • mentions DataLoader worker, shared memory limit, /dev/shm

2) What’s happening

  • LTX-2 training uses video batches (many frames per sample).
  • With multiple DataLoader workers + prefetching, AI Toolkit can queue multiple large batches in /dev/shm.
  • When /dev/shm is too small, a worker process crashes → Bus error → training stops.

RunComfy already provides increased shared memory, but some datasets/settings (especially high num_frames like 121) can still exceed it.


3) Fixes

Tip: Change one variable at a time. Start with Fix A — it solves this for most runs.

Fix A (recommended): reduce DataLoader workers + prefetch

This reduces shared-memory pressure without changing your training data.

Where to change it (RunComfy UI):

  1. Open Your Training Job Panel
  2. Click Show Advanced (top-right)
  3. In the YAML config, find the dataset item under datasets: (the block that contains your folder_path)
  4. Add/adjust these keys inside that dataset item

Try this first (usually enough):

num_workers: 1
prefetch_factor: 1

If it still crashes (most stable, but slower):

num_workers: 0
prefetch_factor: null

⚠️ Important:

  • If you set num_workers: 0, set prefetch_factor: null (exactly null).
  • Lower workers/prefetch affects throughput, not quality. It just changes how data is loaded.

Fix B (self-hosted / Docker): increase /dev/shm

If you run AI Toolkit in Docker yourself, increase the container shared memory:

docker run --shm-size=32g ...
# or safer:
docker run --shm-size=64g ...

You can still apply Fix A as well for stability.


Fix C (if you still hit limits): reduce per-sample memory

If the dataset is extremely heavy, also reduce one or more of:

  • num_frames
  • resolution
  • batch_size

4) Verify the fix

A successful fix looks like:

  • Training proceeds past the DataLoader stage and continues stepping.
  • The crash log no longer repeats “out of shared memory” / “Bus error”.

If you’re self-hosting, also ensure /dev/shm is actually large inside the container.


Notes (common follow-up issue: GPU OOM)

  • 121 frames is heavy. After fixing /dev/shm, you may still hit GPU OOM.
    • Recommended: H200 for 121-frame training.
    • Otherwise reduce batch_size / resolution / num_frames.

Copy-paste example (dataset block)

datasets:
  - folder_path: /app/ai-toolkit/datasets/your_dataset
    num_frames: 121
    resolution: [512, 768]
    caption_ext: "txt"

    # Fix shared memory Bus error (start here):
    num_workers: 1
    prefetch_factor: 1

    # If still crashing, use this instead (slowest but most stable):
    # num_workers: 0
    # prefetch_factor: null

FAQ

Does this mean my dataset is broken?

Usually no. This is a DataLoader shared-memory limit, not a bad dataset.

Why is prefetch_factor: null required when num_workers: 0?

With no workers, prefetching is not used. Setting it to null avoids invalid/unused prefetch behavior in the config.

Should I only do Fix B (increase /dev/shm)?

If you’re on RunComfy, start with Fix A. If you self-host in Docker, Fix B is often necessary, and Fix A still helps.

I fixed the Bus error, but now I get GPU OOM. What next?

Lower batch_size, resolution, or num_frames, or use a larger GPU (H200 recommended for 121 frames).

Ready to start training?