AI Toolkit LoRA Training Guides

FLUX.2 [dev] LoRA Training Guide with Ostris AI Toolkit

This article shows you how to fine-tune FLUX.2 [dev] with LoRA using the Ostris AI Toolkit, step by step. You'll learn what makes FLUX.2 unique, how its dual transformer and text encoder affect LoRA rank and VRAM usage, and how to design datasets and training configs that work on everything from 24GB cards to H100/H200 GPUs.

Train Diffusion Models with Ostris AI Toolkit

Ostris AI ToolkitOstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Datasets

Dataset 1

Sample

Training a LoRA on FLUX.2 [dev] is very different from training older SD‑style models. FLUX.2 [dev] combines a huge rectified‑flow transformer, a 24B Mistral text encoder and a high‑quality autoencoder, and it handles text‑to‑image and image editing in one checkpoint. This guide walks through:

  • What makes FLUX.2 [dev] special
  • How those design choices affect LoRA training
  • How to configure AI Toolkit for different hardware tiers
  • How to set up datasets, triggers and parameters so you actually get the style / character / edit behavior you want

This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this FLUX.2 [dev] guide.

Table of contents


1. Understanding FLUX.2 [dev] for LoRA training

Before touching sliders, it helps to understand what you're fine‑tuning.

1.1 High‑level architecture

From the official FLUX.2‑dev model card and Black Forest Labs’ announcement:

  • Base model

    FLUX.2 [dev] is a 32B‑parameter rectified flow transformer (a DiT‑style latent flow model) trained from scratch, not a continuation of FLUX.1. It combines text‑to‑image generation and image editing (single‑image and multi‑reference) in a single checkpoint.

  • Text encoder

    FLUX.2 [dev] uses Mistral Small 3.1 / 3.2 – 24B as a vision‑language text encoder. That is another 24B parameters on top of the 32B DiT. Under normal precision this alone eats a huge chunk of VRAM and plays a central role in how the model behaves.

  • Autoencoder (VAE)

    The model uses a new AutoencoderKLFlux2 with 32 latent channels (FLUX.1 used 16). It is designed for high‑resolution editing and fine texture preservation, which is why FLUX.2 can do sharp 1024×1024 edits.

  • Unified generation + editing

    The same architecture handles pure text‑to‑image, single‑image editing and multi‑reference editing (up to around 10 reference images). There is no separate "edit‑only" branch; it is all one network.

  • Guidance‑distilled

    FLUX.2 [dev] is a guidance‑distilled model: there is no classic classifier‑free guidance with separate "conditional" and "unconditional" passes. The "guidance" is baked into a single forward pass. At inference you still see a guidance_scale parameter, but it just scales an internal guidance embedding instead of running extra passes.

What this means for LoRA:

  1. The core transformer is enormous.

    Because FLUX.2 [dev] uses a fused, high‑capacity transformer, LoRA rank has to be chosen carefully. Very low ranks (4–8) may barely move the needle. VRAM pressure is dominated by the transformer and the text encoder.

  2. The text encoder is heavy and central to behavior.

    The ~24B Mistral VLM is responsible for how prompts are understood, how instructions are followed, and how editing instructions are interpreted. It is usually frozen during LoRA training, but whether you cache text embeddings or keep the encoder in memory has huge VRAM and quality implications.

  3. The same weights handle T2I and editing.

    If you push a LoRA too hard, you risk changing both text‑to‑image and image editing behavior. Differential Output Preservation (DOP) and careful captioning are what keep the LoRA "tied" to a trigger phrase so that non‑trigger prompts stay close to the base model.

  4. Guidance is special because the model is guidance‑distilled.

    You usually train with guidance_scale = 1. During normal inference you will use guidance_scale around 2–4, but for training previews you keep it at 1 so the LoRA learns the base guidance behavior instead of accidentally changing how guidance itself works.


2. What kind of FLUX.2 LoRA are you actually training?

With FLUX.2 [dev], you should decide first what you want the adapter to do. The base model is already strong at multi‑reference editing and complex prompt following, so you only need a LoRA when you want something persistent that the base model cannot reliably do on its own.

Common goals:

  1. Style LoRA (T2I + editing)

    Teach FLUX.2 to use a specific painting style, color grade or render look when a trigger is present. You typically want it to work for new text prompts and also for "turn this photo into that style" edits.

  2. Character / identity LoRA

    Model a specific person, avatar, mascot or product line, where you care about consistent faces / features across many images. FLUX.2 already does multi‑reference character editing without LoRA, so you usually train a LoRA when you want to avoid uploading references repeatedly or you need very strict identity consistency.

  3. Object / prop / brand LoRA

    Capture specific products, logos, props or shapes with strict geometry or brand constraints, so that invoking the trigger yields the exact object regardless of the rest of the prompt.

  4. Instruction / editing LoRA

    Change behavior instead of style: for example, "turn any portrait into a watercolor sketch", "make a blueprint version", "comic line‑art mode", or structured edit instructions using before/after pairs. These are often image‑edit datasets with (source, target, instruction) triples.

Knowing which of these you are aiming for helps you decide dataset, rank, Differential Output Preservation and guidance settings.


3. FLUX.2 specific details that change LoRA settings

3.1 LoRA on a fused transformer (rank scaling)

FLUX.2 [dev] fuses attention and MLP projections into very wide matrices compared to classic SD1.5/SDXL. That means:

  • Very low ranks (4–8) often feel too weak because they can only carve out a tiny subspace in those huge layers.
  • For style or character LoRAs on FLUX.2 [dev], rank 32 is a good default when VRAM allows it.
  • If VRAM is tight or the style is simple, you can use rank 8–16 and fewer steps.
  • For extremely complex brands or identities with many views and strict structure, rank 32–64 can help, but overfitting becomes more likely, so Differential Output Preservation and careful step counts are important.

In short: FLUX.2 generally benefits from somewhat higher ranks than older models, but you pay for it in VRAM and overfitting risk.


3.2 Guidance‑distilled model: training at guidance_scale = 1

Because FLUX.2 [dev] is guidance‑distilled, the usual Stable Diffusion intuition of "CFG 7–8" does not apply.

  • Training: set guidance_scale = 1.
  • Inference: a guidance_scale in the range 2–4 works well.
    • Lower values (≈2) produce looser, more creative outputs.
    • Higher values (≈3.5–4) are more literal and closely follow the prompt.

3.3 The text encoder is huge (and why caching encodings matters)

The Mistral 24B VLM used as FLUX.2’s text encoder is not a small side module. At normal precision it is on the order of 24GB of parameters by itself.

Diffusers’ own examples and early FLUX.2 benchmarks show that naive FLUX.2 [dev] inference with both the DiT and text encoder in bf16 can require >80GB VRAM, even with some CPU offload. This is why AI Toolkit exposes several Text Encoder Optimizations.

In AI Toolkit, the usual patterns are:

  • If your training setup uses a fixed caption per image and you do not use Differential Output Preservation or any on‑the‑fly prompt rewriting: Turn on Cache Text Embeddings so the toolkit encodes all captions once, caches the embeddings, and then does not need to keep re‑encoding them every step. This reduces VRAM and compute pressure, because the huge text encoder is not hit every batch.
  • If you do use Differential Output Preservation (DOP) or anything else that modifies prompts each step (e.g. replacing [trigger] with a trigger plus a preservation class inside the loop, or heavy caption dropout): You cannot cache text embeddings. Once the prompt becomes dynamic, cached embeddings no longer match the real text, and the training signal is wrong. In that case you keep the text encoder resident in memory (usually in 8‑bit / FP8) and accept the extra VRAM cost.

The trade‑off is simple: caching text embeddings is a big win for static captions, but as soon as your training relies on changing prompts at runtime, you must leave caching off and use a larger GPU if you need more throughput.


3.4 Autoencoder and resolution

FLUX.2 uses a dedicated AutoencoderKLFlux2 that is designed for 1024×1024+ work:

  • It uses 32 latent channels, which gives better detail and editing fidelity than older 16‑channel VAEs at the cost of more VRAM.
  • In practice, training FLUX.2 LoRAs at 768–1024 resolution captures most of the benefit. Going much higher pushes VRAM and training time very hard.

AI Toolkit’s resolution buckets let you list multiple resolutions (for example [768, 896, 1024]). Images are automatically resized and bucketed into the closest resolution. You can safely:

  • Use 768 as a default bucket on 24GB GPUs.
  • Add 896 or 1024 on 32–48GB or cloud GPUs (H100 / H200).
  • Train at a lower bucket (e.g. 768) and still use the LoRA at higher inference resolution later; you simply get less ultra‑fine detail than if you had trained at 1024.

4. Hardware & VRAM requirements for FLUX.2 LoRA training

FLUX.2 [dev] is very memory‑hungry. Diffusers’ reference configs report that running the full DiT + text encoder in bf16 still takes around 62 GB VRAM on an H100, and even heavily quantized 4‑bit inference is still in the ~20 GB range. LoRA training is heavier than inference because you also need memory for gradients and optimizer states, so you must lean on quantization, offloading, small batch sizes and latent/text caching.

4.1 Recommended settings by VRAM tier

Tier A — 16–24 GB GPU (e.g. 4070 Ti, 4080, 4090)

  • What’s realistic

    On this tier, FLUX.2 LoRA training is possible but tight. You’re mostly limited to small style or simple character LoRAs at around 896–1024 px on the long side, with Batch Size = 1 and aggressive memory‑saving settings. Expect slower steps and occasional tuning to avoid CUDA OOM.

  • Key UI settings

    In the MODEL panel, turn Low VRAM ON and Layer Offloading ON so some layers are streamed from CPU RAM instead of living on the GPU the whole time.

    In the QUANTIZATION panel, set Transformer to float8 (default) or a 4‑bit option if your build exposes it, and set Text Encoder to float8 (default).

    In the TRAINING panel, keep Batch Size = 1 and use Gradient Accumulation if you need a larger effective batch.

    In the DATASETS panel, prefer a main resolution of 896–1024 and avoid going above 1024². If you’re not using Differential Output Preservation, consider enabling Cache Text Embeddings later in the TRAINING panel to free the text encoder from VRAM once captions are encoded.

Tier B — 32–48 GB GPU (e.g. RTX 6000 Ada, A6000, some A100)

  • What’s realistic

    This is the first tier where FLUX.2 LoRA training feels comfortable. You can train production‑quality style and character LoRAs at 1024×1024 as your default resolution, with 20–60+ images and 1000–3000 steps. Differential Output Preservation is usable on well‑tuned configs.

  • Key UI settings

    In MODEL, keep Low VRAM ON by default; set Layer Offloading OFF unless you still hit OOM.

    In QUANTIZATION, set both Transformer and Text Encoder to float8 (default); this keeps quality high while comfortably fitting the model.

    In TRAINING, use Batch Size = 1–2, Steps ≈ 1000–3000, Learning Rate = 0.0001, and Linear Rank = 32 in the TARGET panel as a strong default for FLUX.2’s fused transformer.

    In DATASETS, enable 1024 as your main bucket (optionally 768 as a secondary bucket). If you enable Differential Output Preservation, plan to keep the text encoder loaded in float8 instead of relying on cached embeddings.

Tier C — 64–96+ GB GPU (e.g. H100, H200 on RunComfy)

  • What’s realistic

    Here you finally have room to breathe: 1024×1024 with Batch Size = 2–4, larger or multiple resolution buckets, and Differential Output Preservation ON by default are all straightforward. You can experiment with higher ranks (32–64), more steps, and slightly larger resolutions (for example 1152–1408 on the long side) without constantly fighting VRAM.

  • Key UI settings

    In MODEL, you can leave Low VRAM OFF and Layer Offloading OFF; everything lives on the GPU.

    In QUANTIZATION, it’s still efficient to keep Transformer and Text Encoder in float8 (default), but you can selectively disable quantization for experiments if you want to test full‑precision behaviour.

    In TRAINING, use Batch Size = 2–4, Linear Rank = 32–64 for rich styles or complex identities, and turn Differential Output Preservation ON in the Regularization panel for most real projects. You can keep the text encoder resident (no caching) and rely on DOP to preserve base behaviour outside your trigger.


4.2 Local AI Toolkit vs cloud AI Toolkit on RunComfy

You can run this FLUX.2 LoRA workflow in two ways:

  • Locally with AI Toolkit – install AI Toolkit from the AI Toolkit GitHub repository and use your own GPU. Your realistic tier is set by your card’s VRAM; use the tier above that matches your hardware and start from those settings.
  • Cloud AI Toolkit on RunComfy – open the cloud AI Toolkit on RunComfy and train on H100 (80 GB) or H200 (141 GB) GPUs without any local install. On RunComfy your workspace (datasets, configs, checkpoints) persists between sessions, so you can just dial up higher‑tier settings, iterate faster, and spend your time on results instead of infrastructure.

5. Designing datasets for FLUX.2 LoRA

5.1 How many images?

From available FLUX examples and similar LoRA trainings:

  • Simple style LoRA – about 15–30 curated images with a consistent style usually works well.
  • Character / identity LoRA – about 20–60 images with clear views, varied angles and lighting.
  • Editing / instruction LoRA – often pair datasets with 50–200 (source, target, instruction) triples.

Official FLUX LoRA examples on Hugging Face often use hundreds to around a thousand training examples at 1024 resolution, but for most personal style or character LoRAs you can achieve good results with a few dozen well‑chosen images.


5.2 Captioning strategy: what you do not write matters

Whatever you do not describe in the caption is "free" for the LoRA to attach to your trigger.

For a style LoRA, you usually want:

  • Captions that describe what is in the image (person, pose, scene, objects).
  • Captions that do not describe brushwork, colors, medium or composition style.

This way, the LoRA can attach the style to the trigger word instead of to generic tokens like "watercolor" or "oil painting".

For a character LoRA:

  • Use a short, unique trigger (e.g. midnight_tarot) and a class word (person, woman, man, character, etc.).
  • Captions can be things like [trigger] a woman standing in a market, [trigger] a close‑up portrait of a woman in a red jacket, and so on.

That keeps the class word ("woman", "person") available to the base model, while the LoRA learns to attach the identity to the trigger.


5.3 Differential Output Preservation (DOP)

Differential Output Preservation is a regularization strategy used in AI Toolkit that compares:

  • The base model output with no LoRA, and
  • The output with the LoRA active,

and penalizes the LoRA for changing things when a trigger is not present.

In practice:

  • You choose a trigger word (for example midnight_tarot) and a preservation class (for example photo).
  • Captions are written using a placeholder [trigger], such as: [trigger] a woman sitting on a park bench playing a board game with a young girl

At training time, AI Toolkit internally generates two versions of each caption:

  • midnight_tarot a woman sitting on a park bench... – this path trains the LoRA (trigger active).
  • photo a woman sitting on a park bench... – this path teaches the model what to do when the trigger is absent (base behavior preserved).

Whenever the trigger is absent in a prompt at inference time, DOP encourages FLUX.2 to behave close to its original state. It's important when:

  • Your dataset is small or skewed.
  • You are teaching a strong, stylized look that could otherwise leak into non‑trigger prompts.

6. Step‑by‑step: configuring FLUX.2 [dev] LoRA training in AI Toolkit

6.1 One‑time setup

6.2 Prepare your dataset in the Toolkit

  • Gather images for your chosen LoRA type (style, character, object, instruction).
  • Place them in a folder inside AI Toolkit’s datasets directory, for example: /ai-toolkit/datasets/flux2_midnight_tarot/
  • Add .txt caption files with the same base name as each image when you want custom captions, for example image_0001.png + image_0001.txt.
  • Use [trigger] in captions where your trigger word should appear. AI Toolkit will replace [trigger] with the actual value from the JOB panel at load time.

6.3 Create a new training job

In the AI Toolkit UI, create a new job and configure each panel as follows.

6.3.1 JOB panel – name, GPU and trigger word

In the JOB panel:

  • Training Name

    Choose any descriptive name, for example flux2_midnight_tarot_v1. This will become the folder name for checkpoints and samples.

  • GPU ID

    On a local install this selects your physical GPU (typically 0 for a single‑GPU machine).

    On the cloud AI Toolkit on RunComfy, leave this as default; the GPU is chosen when you start the job in the training queue.

  • Trigger Word

    Set this to the actual token you want to type in prompts, for example midnight_tarot. AI Toolkit will replace [trigger] placeholders in your dataset captions with this string when it loads the dataset. Use a short, unique trigger so it does not collide with existing concepts.


6.3.2 MODEL & QUANTIZATION panels – base FLUX.2 model and precision

In the MODEL panel:

  • Model Architecture

    Choose the FLUX.2 architecture if it is listed, or the specific FLUX.2 dev model entry provided in your AI Toolkit build.

  • Name or Path

    It lets you override the default Hugging Face / model hub path for FLUX.2 [dev]. Leave it blank or at the default value and AI Toolkit will download the recommended base model from Hugging Face. Or point it to a local path if you’ve downloaded a local copy or want to use a custom FLUX.2 checkpoint.

    FLUX.2 [dev] is a gated Hugging Face model, so you must accept its license and set HF_TOKEN in a .env file before AI Toolkit can download it.

  • Low VRAM

    Turn Low VRAM ON on Tier A and often Tier B so FLUX.2 fits comfortably on 16–24 GB GPUs via internal memory optimizations.

    You can leave it OFF on Tier C (H100/H200) where you have plenty of VRAM and do not need these trade‑offs.

  • Layer Offloading

    Enable this on Tier A so layers are streamed in and out of GPU memory as needed.

    On Tier B and C you can usually leave it OFF.

In the QUANTIZATION panel:

  • Transformer

    Set Transformer to float8 (default) on Tier B and C.

    On very tight VRAM (Tier A, 16–24 GB), you can experiment with a 4‑bit option if you understand the quality trade‑offs.

  • Text Encoder

    Set Text Encoder to float8 (default) so the 24B Mistral text encoder runs in FP8.

    You will decide later, in the Training panel, whether to keep it loaded or rely on cached embeddings depending on Differential Output Preservation.


6.3.3 TARGET panel – LoRA network settings

In the TARGET panel:

  • Target Type

    Set Target Type to LoRA.

  • Linear Rank

    Use Linear Rank 32 as a strong default for FLUX.2, as the fused transformer benefits from somewhat higher ranks.


6.3.4 TRAINING & SAVE panels – core hyperparameters and text encoder handling

In the Training panel:

  • Batch Size

    Use 1 on 24–48GB GPUs.

    Use 2 on 64GB+ GPUs such as H100/H200.

    This is the number of images processed per optimizer step.

  • Gradient Accumulation

    Start with 1.

    If VRAM is tight but you want a larger effective batch, increase to 2–4.

    Effective batch size is Batch Size × Gradient Accumulation.

  • Steps

    As a baseline:

    • Style LoRA, 15–30 images: 800–2000 steps.
    • Character LoRA, 30–60 images: 1000–2500 steps.
    • Instruction / edit LoRA with 100+ examples: 1500–3000 steps.
  • Optimizer

    Use an 8‑bit optimizer such as AdamW8Bit unless you have a specific reason to use another optimizer.

  • Learning Rate

    Start with 0.0001.

    If you see overshooting or unstable samples, lower it to 0.00005 and resume from the last good checkpoint.

  • Weight Decay

    Keep 0.0001 unless you are debugging a specific overfitting issue.

  • Timestep Type

    It decides which noise levels are sampled more often during training. For FLUX.2,

    • weighted uses AI Toolkit’s FLUX‑tuned schedule and is the recommended default for most style and character LoRAs, giving balanced coverage while slightly favouring the useful mid‑range steps.
    • sigmoid concentrates even more strongly on the middle of the schedule and is mainly worth using when you deliberately want to push the LoRA’s capacity into mid‑range detail (for example very small, detail‑heavy character datasets).
  • Timestep Bias

    It is a second control that tilts training toward early noisy steps (coarse layout) or late clean steps (fine detail).

    • Balanced keeps both regions represented and is the safest choice for most FLUX.2 LoRAs. Biasing toward high noise makes the LoRA more willing to change global composition or structure;
    • Biasing toward low noise makes it behave more like a detail filter that preserves the base model’s layout but strongly edits textures, faces and micro‑style.
  • Loss Type

    It controls how the gap between the model’s prediction and the training target is measured. FLUX.2 is trained with a squared‑error objective, so

    • keeping Mean Squared Error means your LoRA is optimising the same quantity as the base model and will behave predictably across noise levels.
    • Alternative losses (L1, Huber, etc.) effectively reweight which errors matter and are only recommended if you are explicitly experimenting with non‑standard objectives.
  • EMA (Exponential Moving Average)

    Leave Use EMA OFF for LoRAs. EMA is more useful when training full models and requires extra VRAM.

  • Text Encoder Optimizations

    This is where you combine Cache Text Embeddings and Unload TE correctly:

    • If you are not using Differential Output Preservation and your captions are static (no caption dropout, no on‑the‑fly prompt rewriting), use:
      • Cache Text Embeddings: ON – AI Toolkit encodes every caption once and reuses those embeddings throughout training. This reduces compute and VRAM pressure from the text encoder.
      • Unload TE: OFF – the dedicated "Unload TE" mode is for trigger‑word / blank‑prompt training where dataset captions are ignored. For normal caption‑based training with static captions you only need cache.
    • If you are using DOP or anything that changes prompts each batch:
      • Keep both Cache Text Embeddings: OFF and Unload TE: OFF so the text encoder stays loaded and can re‑encode each modified prompt correctly.

In the SAVE panel:

  • Data Type

    Set Data Type to BF16. This matches how FLUX.2 is usually run and keeps LoRA checkpoints compact.

  • Save Every and Max Step Saves to Keep

    Use defaults like Save Every = 250 steps and Max Step Saves = 4.

    This means you get a checkpoint every 250 steps and only the four most recent are retained, keeping disk usage under control while still giving you multiple options to choose from.

About guidance during training (not need to set)

FLUX.2 is guidance‑distilled: The LoRA trainer effectively trains with an internal guidance of 1, and you do not need to set classifier‑free guidance yourself.


6.3.5 Regularization & Advanced – Differential Output Preservation and Differential Guidance

Regularization panel – Differential Output Preservation (DOP)

If you want to preserve the base model’s behavior when your trigger is not present, enable Differential Output Preservation.

In the Regularization panel:

  • Turn Differential Output Preservation ON.
  • Set Trigger (or Trigger Word) to the same trigger you used in the JOB panel, for example midnight_tarot.
  • Set Preservation Class to a neutral word like photo.

In your dataset captions, you should already be using [trigger] placeholders as described earlier. At training time, AI Toolkit expands them to:

  • midnight_tarot ... – this path trains the LoRA.
  • photo ... – this path tells the model what to do when the trigger is not present.

In the Training → Text Encoder Optimizations section, remember:

  • With DOP ON, keep Cache Text Embeddings OFF and Unload TE OFF because prompts are dynamically rewritten every batch and must be freshly encoded.

This setup teaches the LoRA only the difference between base FLUX.2 and "FLUX.2 plus trigger", so non‑trigger prompts stay very close to the original model.

Advanced panel – Differential Guidance

In the Advanced panel:

  • Do Differential Guidance – enables an experimental training target that exaggerates the gap between the model’s current prediction and the ground‑truth noise/image. After each forward pass, AI Toolkit measures the difference between prediction and target and then asks the LoRA to aim slightly past the true target along that direction. In practice this acts like a per‑sample, per‑pixel boost to the effective learning rate exactly where the model is wrong, so it usually makes fine details lock in faster without changing anything else in your config.
  • Differential Guidance Scale – controls how strong that "overshoot" is. Higher values push the model harder toward (and slightly beyond) the target; lower values keep the effect mild. Too high a scale, combined with a large learning rate or a very strong dataset, can make training look noisy or oversharpened because the model is being pushed too aggressively.

Practical recommendations

  • For most FLUX.2 LoRAs (characters, styles, clean edits), it is safe to turn Do Differential Guidance ON and start with a **Differential Guidance Scale of 3, which is what the author of AI Toolkit uses in their own examples.
  • If early samples look unstable, overly sharp, or "ringy", first lower the Differential Guidance Scale to 2 or 1, or slightly reduce the global Learning Rate**, instead of disabling the feature outright.
  • You can leave Do Differential Guidance OFF if you want the most conservative, textbook training behaviour or are debugging other issues. It does not meaningfully reduce VRAM usage or step time; it mainly changes how far each training step moves the LoRA toward the target.

6.3.6 DATASETS panel – attaching datasets and caching latents

In the DATASETS panel, click Add Dataset if one is not already configured.

For a simple style or character LoRA:

  • Target Dataset

    Choose the dataset you created earlier, for example flux2_midnight_tarot.

  • Default Caption

    If you did not create per‑image .txt files, enter a default like: [trigger] a portrait of a person

    This default caption is used whenever an image has no explicit caption file.

  • Caption Dropout Rate

    A value around 0.05 is a good default when you are not caching text embeddings (that is, when Cache Text Embeddings is OFF). Caption dropout randomly removes text conditioning for some samples and encourages robustness, but it requires recomputing text embeddings every step and does not work with cached embeddings.

    If you turn Cache Text Embeddings ON, set Caption Dropout Rate to 0, because captions must remain static for caching to be correct.

  • Settings → Cache Latents

    Turn Cache Latents ON. The VAE will encode each training image once to a latent file on disk, and training will then operate purely in latent space, saving VRAM and compute.

  • Settings → Is Regularization

    Leave this OFF for your main dataset. Use it only when adding a separate regularization dataset.

  • Resolutions

    Enable resolution buckets appropriate for your VRAM tier:

    • On 16–24GB: start with 768 and optionally 896 as buckets.
    • On 32–48GB: use [768, 896, 1024].
    • On 64GB+: you can add a slightly higher bucket if your images justify it.
  • Augmentations (X/Y flip)

    Horizontal flip can be useful for some style LoRAs but is often questionable for faces (which are asymmetric). Vertical flip is rarely useful for photographic styles. Use flips only if you know why you need them.


6.4 Preview sampling configuration

Sampling during training does not affect the training process itself, but it is how you decide which checkpoint is best.

In the SAMPLE panel:

  • Sample Every

    Set Sample Every = 250 steps so each saved checkpoint has a corresponding set of preview images.

  • Sampler

    Use the sampler recommended by your FLUX.2 template (typically a rectified‑flow / flow‑match sampler).

  • Width / Height

    Choose a resolution that matches your training buckets, for example 768×768 or 768×1024.

    You do not need 1024×1024 previews if they are too slow; use 768 for quick inspection.

  • Guidance Scale

    Set guidance_scale = 1 for training previews, in line with the guidance‑distilled design.

  • Sample Steps

    Around 25 steps is usually enough for monitoring; you can use more steps later during final inference if desired.

  • Seed / Walk Seed

    Fix a seed (for example 42) so you can compare checkpoints consistently.

    You can enable "walk seed" if you want each preview to vary slightly while remaining comparable.

  • Prompts

    Add 2–4 representative prompts that match your training distribution:

Every 250 steps, the sampler will create preview images for those prompts, letting you see if the LoRA is converging or overfitting.


7. Debugging FLUX.2 LoRA results and improving quality

7.1 "Nothing changes after 1000+ steps"

Checklist:

  1. Is the LoRA actually applied in sampling?

    Make sure the LoRA is attached to the correct FLUX.2 base model, the LoRA scale / weight is non‑zero in the preview config, and the sample uses the trigger word (for style / character LoRAs).

  2. Linear Rank too low for the fused transformer

    If you set Linear Rank to only 4–8, the effect on FLUX.2’s fused attention / MLP blocks can be very small. In practice, try Linear Rank = 16–32 with the same training settings (Steps and Learning Rate).

  3. Learning Rate too low

    If you set the Learning Rate significantly below 0.0001 while also using heavy quantization and offloading, updates can become so small that the LoRA barely changes anything. Start with 0.0001. If you see overshooting or noisy results, lower to 0.00005 instead of starting too low.

  4. Captions describing the style instead of the content

    If every caption says something like "watercolor, soft pastel strokes, loose brushwork, blue tones…", there is nothing left for the trigger to represent. Remove stylistic descriptors from captions and let the trigger be the style.


7.2 "My LoRA overwrote the base model"

Symptoms:

  • Even with no trigger, outputs already look like your LoRA style.
  • The model feels biased toward your subject or clothing from training.

Fixes:

  1. Turn on Differential Output Preservation

    Configure trigger and preservation class as described above. Expect a VRAM hit because the text encoder stays active and extra passes are done.

  2. Reduce training steps

    For many style LoRAs at rank 32, 800–1500 steps are enough. Stop early if you see non‑trigger images drifting strongly toward your style.

  3. Lower rank or Learning Rate

    Try Linear Rank = 16 and Learning Rate = 0.000075 while keeping DOP on. This gives a weaker, more controlled adapter.


7.3 "CUDA out of memory" or training stuck

Usual survival plan:

  1. Lower resolution

    Drop from 1024 → 896 or 768 on the long side.

  2. Enable / increase gradient checkpointing and accumulation

    Turn on gradient checkpointing and increase gradient accumulation so each individual step uses less VRAM.

  3. Aggressive quantization

    Use FP8 or even 4‑bit for the transformer and 8‑bit / FP8 for the text encoder.

  4. Use latent caching

    Enable Cache Latents so the VAE is unloaded and training runs purely in latent space.

  5. On very tight VRAM, avoid DOP

    Instead of using Differential Output Preservation on a small card, prefer:

    • A small, balanced dataset.
    • Fewer training steps with early stopping.
    • Possibly running a more demanding configuration on a cloud H100 / H200 later.
  6. Move the job to a larger GPU if needed

    If your local GPU still runs out of memory, migrate the same AI Toolkit job to RunComfy’s H100/H200 templates, where the same settings will typically no longer trigger OOM errors.


8. Using your FLUX.2 LoRA in inference

Once training is complete, you can use your FLUX.2 LoRA in two simple ways:

  • Model playground – open the FLUX.2 LoRA playground) and paste the URL for your trained LoRA to quickly test its effect on top of the base model.
  • ComfyUI workflows – start a ComfyUI instance and either build your own workflow or load one like Flux 2 Dev, add your LoRA in the LoRA loader node, and fine‑tune the LoRA weight and other settings for more detailed control.

More AI Toolkit LoRA training guides

Ready to start training?