AI Toolkit LoRA Training Guides

LTX-2 LoRA Training with Ostris AI Toolkit

This guide shows you how to train LTX-2 LoRAs with the Ostris AI Toolkit. It covers what makes LTX-2 different (audio-video native DiT, large checkpoint), how to prepare image or video datasets (including 8n+1 frame counts), safe starter settings for rank/steps/LR, how to monitor training with FlowMatch samples, and common fixes for VRAM and overfitting issues.

Train Diffusion Models with Ostris AI Toolkit

Scroll horizontally to see full form

Ostris AI ToolkitOstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Datasets

Dataset 1

Sample

LTX-2 is an open-weights Diffusion Transformer (DiT) foundation model designed to generate synchronized video and audio within a single model. Unlike "silent" video models, it’s built as a joint audio-video system so motion and sound can line up in time. In the official release, the primary checkpoint family is 19B-class (with a trainable "dev" variant, multiple quantized variants, and an accelerated distilled variant).

This guide focuses on training LTX-2 LoRAs using Ostris AI Toolkit. The goal is to make you productive fast: what LTX-2 is good at, what data to prepare, which AI Toolkit knobs matter, and which settings are usually "safe" for a first run.

If you don’t want to install AI Toolkit locally, you can run it in the browser on RunComfy’s cloud GPUs (H100 / H200).

▶ Start here RunComfy cloud AI Toolkit


Table of contents


1. Why LTX-2 behaves differently from other video LoRA targets

A few LTX-2 specifics directly shape how you should train:

  • Audio-video is native: LTX-2 is built to generate synchronized audio and visuals in one model (not a bolt-on). That’s great for "finished shots" (dialogue, ambience, foley), but it also means that audio-aware finetuning depends on whether your trainer actually updates the audio pathway and cross-modal components (many third-party training stacks start by finetuning video-only).
  • It’s big (19B-class checkpoints): You’ll feel this in VRAM, step time, and the fact that "tiny ranks" often underfit. The official checkpoint list includes:
    • ltx-2-19b-dev (trainable in bf16),
    • dev quantized variants (fp8 / nvfp4),
    • and ltx-2-19b-distilled (accelerated inference, 8 steps, CFG=1).
  • Hard shape constraints: Width/height must be divisible by 32, and frame count must be divisible by 8 + 1 (i.e., 8n+1: 1, 9, 17, 25, …, 121, …). If your input doesn’t match this, you typically need to pad (commonly with -1) and then crop back to the target size/frame count, some pipelines handle this automatically, others will error during preprocessing.

2. What LTX-2 LoRAs are best for

In practice, LTX-2 LoRAs tend to be most valuable in these directions:

  • Character / identity LoRAs: consistent face, costume, props, "brand character" look, and stable identity across camera moves.
  • Style LoRAs: art direction (lighting language, rendering style, lensing, film stock vibe), while leaving subjects flexible.
  • Motion / choreography LoRAs: a specific movement pattern (walk cycle style, dance flavor, creature locomotion), or "how the world moves" (handheld shake, animation timing).
  • Camera behavior LoRAs: dolly-in/out, crane/jib feel, orbital camera language, stabilized vs handheld.
  • (Advanced) Audio LoRAs: consistent ambience palette, foley style, or voice-like characteristics—only if your training stack supports finetuning the audio branch.

If you only have images (not video), you can still train identity/style effectively, but you should not expect it to learn temporal motion patterns from single frames.


3. Dataset prep for LTX-2 LoRA Training

3.1 Pick the right clip length + resolution "budget"

LTX-2 training cost scales with both spatial size and frame count. For a first LoRA, keep it simple:

  • Identity / style starter:
    • Resolution: 512–768-ish (depending on your GPU)
    • Frames: 49 or 81 (shorter clips train faster; still enough for temporal consistency)
  • Motion / camera starter:
    • Resolution: 512 (or 768 if you have headroom)
    • Frames: 121 (good for motion learning; ~5 seconds at 24 fps)

Remember the constraint: frames must be 8n+1.

3.2 Video vs image datasets (both are valid)

Many people assume LTX-2 requires video-only datasets. In reality, most practical training stacks can work with either:

  • Image-only datasets (treat each sample as a "1-frame clip"), or
  • Video datasets (short coherent clips).

If you’re using AI Toolkit, it’s usually simplest to keep each dataset entry homogeneous (all images or all videos) and use separate dataset entries if you need to mix modalities.

  • For images: frames = 1 satisfies 8n+1.
  • For videos: use short, coherent clips; avoid long multi-scene segments.

This is a big deal for character work: you can bootstrap identity from images, then refine motion later with short clips.

3.3 How much data do you need (realistic scale)?

There isn’t a single "official minimum," but these ranges are realistic starting points:

  • Image-based LoRAs (identity / props / style): start around ~20–50 clean, varied images. If you want stronger robustness across lighting, lenses, and compositions, ~50–150 curated images usually helps more than repeating near-duplicates.
  • Video-based LoRAs (motion / camera / temporal consistency): aim for ~20–60 short, coherent clips (single-action shots) rather than a couple of long videos. For broader or more motion-heavy goals, scaling toward ~50–150 short clips (or roughly ~10–30 minutes of "good" footage) tends to produce noticeably more stable results.

3.4 Caption quality matters more than you think

LTX-2 responds well to longer, more descriptive captions, especially if you want controllable results. If your clips include speech or key sound cues, include them in captions (or transcript excerpts) when your training stack supports it.

Practical caption tips:

  • For identity LoRAs, include consistent identity tokens (and vary everything else: lighting, wardrobe, background, lens).
  • For style LoRAs, keep style descriptors consistent and vary subjects/actions.
  • For motion LoRAs, describe the action precisely (tempo, body mechanics, camera move).

3.5 Regularization is your "anti-bleed" tool (use it when the LoRA is narrow)

If you’re training a tight concept (one character, one product), it’s easy to overfit and get "everything looks like my dataset". In AI Toolkit, Differential Output Preservation (DOP) is designed to reduce that kind of drift, and it pairs naturally with a "regularization" dataset.

A simple reg set:

  • Generic clips/images in similar framing to your main dataset
  • Captions that match the general domain (but not your unique identity token)

4. How Ostris AI Toolkit thinks about training

AI Toolkit is essentially a consistent training engine wrapped in a UI: you pick a model family, attach datasets, define a LoRA target + rank, and tune optimization + sampling. The UI panels map cleanly to the underlying training config: Job, Model, Quantization, Target, Training, Regularization, Datasets, Sample.

What this means for you: you don’t need model-specific scripts for the basics, the same mental model (rank/steps/LR/caching/regularization) applies, but LTX-2’s size and video nature make a few settings more "sensitive" (rank, VRAM optimizations, frames).

If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview so the UI panels and core parameters make sense before you tune LTX-2 specifics:

AI Toolkit LoRA training overview

If you’re deciding where to run:

  • Local AI Toolkit: best if you already have a compatible GPU and want full control over your environment.
  • RunComfy cloud AI Toolkit: best if you want to skip setup, train on high VRAM GPUs, and iterate faster with fewer "it doesn’t run on my machine" issues—especially helpful for LTX-2’s larger checkpoints and video workloads. ▶ Open RunComfy cloud AI Toolkit

5. Step-by-step: LTX-2 LoRA Training in AI Toolkit

5.1 Create your dataset in AI Toolkit

In the Datasets panel / Dataset section of the job:

  • Target Dataset: your uploaded dataset
  • Default Caption: leave blank unless you need a global suffix
  • Caption Dropout Rate: start around 0.05 (helps generalization)
  • Cache Latents: ON if you can spare disk space (big speed win on repeats, but video latent caches get big fast)
  • Num Frames:
    • 1 for image-only datasets
    • 49 / 81 / 121 for video, depending on your goal
  • Resolutions: start with 512 + 768 enabled; avoid 1024+ until you’ve proven your setup

If you’re doing a narrow identity LoRA, add a second dataset entry and mark it Is Regularization (and keep its weight lower or equal, depending on how aggressive you want preservation to be).

5.2 New Training Job → Model

In the Model section:

  • Model Architecture: LTX-2 (if available in your build)
  • Name or Path: the Hugging Face model id for the base model (e.g. Lightricks/LTX-2)
  • Checkpoint selection: pick the dev checkpoint for training:
    • ltx-2-19b-dev is the full model and is trainable in bf16.
    • The distilled checkpoint is primarily for fast inference (8 steps, CFG=1) and is not the default starting point for LoRA training unless you specifically want to adapt the distilled behavior.

5.3 Quantization + VRAM options

LTX-2 is large, so you’ll often use quantization/offload:

  • If you’re on H100/H200-class VRAM, you can often run bf16 more comfortably.
  • If you’re on 24–48 GB GPUs, quantization and "Low VRAM" modes become essential.

Two practical notes:

  • LTX-2 itself ships with official quantized variants (fp8 / nvfp4) of the full model; whether you can train from those weights depends on your trainer implementation.
  • Separately, 8-bit optimizers (e.g. AdamW8bit) are commonly used to make training practical on consumer hardware.

5.4 Target = LoRA + Rank

This is where LTX-2 differs from smaller models.

  • Target Type: LoRA
  • Linear Rank: start at 32
    • Many LTX-2 LoRA trainers report that rank 32 is a practical minimum for solid results.
    • If you have VRAM headroom and want more capacity (complex style, multi-concept), test 64.

5.5 Training hyperparameters (a solid first run)

Start with values that won’t explode:

  • Batch Size: 1 (video almost always ends here)
  • Gradient Accumulation: 2–4 if you want a steadier effective batch (and can afford time)
  • Steps:
    • 2,000–3,000 for a first pass
    • go longer if you have a larger dataset or subtle style
  • Optimizer: AdamW8bit (common choice for VRAM efficiency)
  • Learning Rate: 0.0001 to start, 0.00005 if you see overfitting or identity "burn-in" too fast
  • Weight Decay: ~0.0001
  • Timestep Type / Bias: keep defaults unless you know why you’re changing them
  • DOP / Blank Prompt Preservation: enable DOP if you see style bleed or loss of base versatility.

5.6 Sampling during training (don’t skip this)

Sampling is your early warning system. Use it.

  • Sample Every: 250 steps (good cadence)
  • Sampler / Scheduler: start with whatever your LTX-2 preset defaults to, and only experiment after you have a baseline.
  • Guidance + steps depend on which checkpoint you’re sampling:
    • For dev runs, a common starting point is guidance ~4 with 25–30 sampling steps.
    • For distilled, the published behavior is 8 steps, CFG=1 , so sample with guidance = 1 and steps = 8 (or you’ll get "why does this look worse?" confusion).
  • Width/Height/Frames: match your training bucket (or a representative target)

Write sample prompts that match your real use:

  • Include your trigger word (for identity LoRAs).
  • Include camera/motion descriptors if those matter.
  • Keep one "boring" prompt that reveals overfitting (plain lighting, simple action).

6. LTX-2 LoRA Training time expectations

There isn’t one universal number, treat runtime as a practical estimate that can swing with frames/resolution, offload/quantization choices, and how often you sample.

A realistic mental model:

  • Frames are often the biggest lever: 121 → 81 → 49 can be the difference between "this trains" and "this crawls / OOMs."
  • Sampling overhead can rival training time if you sample big videos frequently.

As a rough reference point: on an H100, with a small video dataset (~20 clips, 3–5s each), batch=1, rank=32, and gradient checkpointing enabled, it’s common to see single-digit seconds per training step at a 768-ish resolution bucket with a mid-length frame bucket (e.g., 49–81 frames). Your exact step time will vary heavily with I/O, caching, and whether you’re doing any audio-aware preprocessing.

Also budget for sampling: a "3 prompts × 25 steps × 121 frames @ 1024×768" preview can easily take minutes each time it runs. If you sample every 250 steps, that overhead can add up quickly across a 2,000-step run.


7. Common gotchas in LTX-2 LoRA Training (and how to fix them)

  • Wrong frame counts: if your dataset uses 120 frames instead of 121, you’ll hit errors or silent mismatch. Stick to 8n+1 frame counts (1, 9, 17, 25, …, 49, 81, 121, …).
  • Wrong sizes: width/height must be divisible by 32. If you’re using a pipeline that doesn’t auto-pad, resize/bucket accordingly.
  • Rank too low: symptoms are "it trains but nothing changes," or weak identity/style strength even at LoRA scale 1.0. Try rank 32.
  • Overfitting / LoRA bleed: your subject appears in unrelated prompts. Enable DOP and add a reg dataset.
  • Captions too short: prompt adherence collapses. Expand captions (what, where, camera, motion, mood; plus audio cues/transcript if relevant).
  • Distilled sampling confusion: if you’re sampling the distilled checkpoint with 25+ steps or CFG>1, you’re not testing it the way it’s intended. Use 8 steps, CFG=1 for distilled previews.
  • VRAM OOM: reduce frames first (121 → 81 → 49), then reduce resolution (768 → 512), then turn on offload/quantization/caching.

8. LTX-2 LoRA Training: Quick FAQ

Can I train an LTX-2 LoRA from images only?

Yes, use an image-only dataset and set frame count to 1. Great for identity and style. Not great for learning motion.

Dev vs distilled checkpoint for LoRA training?

Start with ltx-2-19b-dev for training; it’s explicitly described as flexible/trainable in bf16. Distilled checkpoints are primarily about fast inference (8 steps, CFG=1).

What rank should I use?

Start at 32. That’s where many early LTX-2 trainers are landing for "it actually learns."

Why do my samples look jittery or inconsistent?

Usually a mix of: too-long clips for your VRAM (forcing aggressive offload), captions not describing motion/camera, or sampling settings that don’t match the checkpoint (especially sampling distilled like it’s dev). Reduce frames, tighten captions, and align guidance/steps to the checkpoint you’re sampling.


9. Learn more: other AI Toolkit LoRA training guides

If you want to compare workflows, datasets, and parameter tradeoffs across model families, these guides are good reference points:

Ready to start training?