AI Toolkit LoRA preview/samples look great, but ComfyUI/Diffusers inference is different?

If you trained a LoRA with Ostris AI Toolkit, saw great results in Training Samples, and then got very different outputs in ComfyUI or Diffusers, you might be feeling pretty confused about why everything suddenly looks different, and you may have searched for things like:

“AI Toolkit samples look good but ComfyUI looks bad”
“AI Toolkit LoRA not working in ComfyUI”
“Same prompt/seed, but Diffusers doesn’t match AI Toolkit preview”
“Why does my LoRA look different after training?”

Here’s the blunt (and usually relieving) answer:

In ~99% of cases, your LoRA is fine. Your inference setup isn’t the same as your training preview.

AI Toolkit Samples are not “random Diffusers runs”. They’re produced by a specific combination of:

the exact base model variant
how the LoRA is injected (adapter vs merged/fused)
step/scheduler/guidance semantics
resolution handling (snapping/cropping rules)
seed/RNG behavior
(sometimes) extra conditioning inputs (edit/control/I2V wiring)

If any of those differ, outputs drift.**

Here’s a quick side-by-side example of what you’re seeing: AI Toolkit Training Sample vs your current inference toolchain (ComfyUI/Diffusers/etc.).

AI Toolkit training sample	Your inference toolchain

What you’ll get from this guide

60‑second sanity check: lock these 6 fields (don’t touch training yet)
10‑minute parity checklist: 7 common “preview ≠ inference” causes (in priority order)
Don’t want to debug it yourself? Get inference parity on RunComfy

If you just want inference to match your AI Toolkit training samples (start here)

Recommended (parity workflow): AI Toolkit LoRA Training‑Inference Parity
Want to audit/self‑host (Diffusers reference): Open-source AI Toolkit inference code

60‑second sanity check: lock these 6 fields (don’t touch training yet)

Pick one AI Toolkit Training Sample you want to reproduce (Sample #1 is ideal). Copy or export these exactly:

1) Base model (exact variant/revision — not just “FLUX”)

2) Prompt (including the trigger word, in the same position)

3) Negative prompt (if training used none, keep it empty)

4) Width/height

5) Seed

6) Steps + guidance (and anything sampler/scheduler‑related)

If you match those 6 fields and results still diverge a lot, you’re looking at one of the parity issues below.

10‑minute parity checklist: 7 common “preview ≠ inference” causes (in priority order)

Rule of thumb: change one variable at a time. If you change five things at once, you’ll never know what fixed it.

1) Base model mismatch (most common, most destructive)

2) Resolution handling mismatch (snapping to multiples / hidden resize)

3) Steps, scheduler, guidance semantics differ (few‑step models are ultra sensitive)

4) Seed/RNG semantics differ (CPU vs GPU generator, global seeding)

5) LoRA application differs (adapter vs fuse/merge; wrong loader)

6) Prompt/negative/trigger word not identical (one token can break it)

7) Wrong pipeline family / missing conditioning inputs (edit/control/I2V)

Now, how to diagnose each quickly.

1) Base model mismatch: “looks close enough” is not close enough

Symptoms

The whole look drifts: faces, texture, style, detail quality.
It can look like the LoRA “doesn’t apply” at all.

Why it happens

LoRAs are highly base‑model‑specific. Training on one exact base model variant and inferring on another (even “same family”) often produces major drift.

Fast test

Open your AI Toolkit training config/YAML.
Find the exact base model identifier/path.
Verify ComfyUI/Diffusers is loading the same base model (not a similarly named one).

Common traps

Mixing FLUX variants (e.g., dev vs schnell, 1 vs 2)
Mixing Qwen Image generation vs Qwen Image Edit variants
Mixing Z‑Image Turbo vs DeTurbo
Using the wrong WAN 2.2 task family (T2V vs I2V)

Fix

Treat your training config as the source of truth: use the training config file (YAML) to select the base model, not memory or guesswork.

2) Resolution handling mismatch: you think it’s 1024, but it isn’t

Symptoms

Composition shifts, sharpness changes, details smear.
Drift is much worse on few‑step/turbo models.

Why it happens

Many inference implementations adjust width/height to a divisor (often 32). If AI Toolkit preview snaps down but your inference stack doesn’t (or vice versa), you aren’t actually running the same input size.

Example pattern:

width  = (width  // divisor) * divisor
height = (height // divisor) * divisor

Fast test

Force your inference run to a clean multiple of 32 (e.g., 1024×1024, 1216×832).
In ComfyUI, check for hidden resize/crop/latent scaling nodes.

Fix

For parity, lock width/height to what the training preview effectively used (including snapping rules).

3) Steps/scheduler/guidance semantics differ: few‑step models punish “SDXL habits”

Symptoms

The result turns blurry/dirty, loses “training preview crispness.”
Or it explodes into overcooked artifacts.

Why it happens

Distilled/turbo/few‑step models often expect low steps and low guidance (sometimes guidance ≈ 1.0). If you apply SD/SDXL defaults (steps 20–30, CFG 5–8), you can push the model out of its intended regime.

Fast test

Force steps and guidance to match the Training Sample exactly.
Don’t change schedulers while troubleshooting parity.

Fix

Start from the Training Sample’s settings as ground truth. Once you can reproduce it, then tune for your desired look.

4) Seed/RNG semantics differ: same seed ≠ same noise stream across stacks

Symptoms

Same prompt + same seed, but outputs are wildly different.

Why it happens

Different stacks can implement seeding differently:

global seeding vs per‑node seeding
CPU generator vs GPU generator
extra RNG consumption (random crops, jitter, etc.)

Fast test

First ensure you can reproduce the image within the same stack (run 3 times, get the same output).
Then compare across stacks.

Fix

Align seed handling as closely as possible (global seed + explicit generator semantics).

5) LoRA application differs: adapter vs fuse/merge (and “wrong loader”)

Symptoms

LoRA effect is weaker/stronger than in preview.
Or it looks like the LoRA isn’t applied.

Why it happens

Two common differences:

Adapter application: dynamically applies LoRA with a scale during inference.
Fuse/merge application: merges LoRA weights into the model and unloads adapters.

Those can behave differently. Also, using a LoRA loader/pipeline that doesn’t match the model family can silently “apply nothing.”

Fast test

Match the LoRA scale to the Training Sample.
Confirm your loader is correct for that model family (don’t mix pipelines across families).

Fix

Use a reference implementation that’s known to support AI Toolkit LoRAs for that model family, then reproduce the sample, then port the same method into your preferred stack.

6) Prompt / negative / trigger word mismatch: one token can break parity

Symptoms

The style feels close, but the “signature details” are gone.
Or it behaves like a generic base model.

High‑frequency pitfalls

Training negative prompt is empty, but your inference UI injects a default negative prompt.
Trigger word is missing, misspelled, or moved.
Different prompt parsing/weight syntax across tools.

Fast test

Set negative prompt to empty (match training).
Copy/paste the exact prompt text from the training sample.

Fix

For parity testing, eliminate hidden defaults. Run “clean” first.

7) Wrong pipeline family / missing conditioning inputs (edit/control/I2V)

Symptoms

The output logic is completely wrong.
Or inference errors out.

Why it happens

Some model families require extra inputs (control images, edit inputs, I2V conditioning). Training preview might be using that wiring, but your inference run might be prompt‑only.

Fast test

Does the model require a control image or edit input?
Are you using the correct pipeline family for the task?

Fix

Switch to the correct pipeline and provide the required inputs.

Don’t want to debug it yourself? Get inference parity on RunComfy

Don’t want to debug ComfyUI/Diffusers drift yourself? RunComfy already provides training & inference parity pipelines for the AI Toolkit model families, so you don’t have to rebuild the exact preview pipeline by hand.

All you need is your LoRA and the Training config file (YAML) from your AI Toolkit run. Import them into Trainer → LoRA Assets, then click Run (Run LoRA) to start inference immediately. RunComfy will run the correct base‑model pipeline with the same parity‑critical behavior so your results match the training previews.

For the exact workflow, see: AI Toolkit LoRA Training‑Inference Parity.

Here’s what that looks like in practice (AI Toolkit Training Sample vs RunComfy inference, with training config applied):

AI Toolkit training sample	RunComfy inference (Playground/API)

Want to go deeper?

Open-source AI Toolkit inference code (auditable Diffusers pipelines, adapters, schedulers)

FAQ

“My AI Toolkit LoRA works in training Samples, but does nothing in ComfyUI.”

Start with base model mismatch and LoRA loader mismatch. Most “it does nothing” reports are one of:

wrong base model variant
wrong LoRA injection method/loader
hidden negative prompt/defaults

“Why are AI Toolkit Samples sharper than my Diffusers output?”

Usually one of:

steps/guidance regime mismatch (especially few‑step models)
resolution snapping differences
scheduler/timestep differences

“How do I make inference match training preview reliably?”

Treat the training config as ground truth and lock:

base model
width/height (including snapping rules)
steps/guidance/scheduler family
LoRA injection method and scale
seed semantics

If you want this as a repeatable Run LoRA workflow (Playground/API), build your inference around that same config.