AI Toolkit LoRA Training Guides

Z-Image LoRA Training (Z-Image Turbo + De-Turbo) with Ostris AI Toolkit

This guide explains how to train a high-quality Z-Image LoRA with Ostris AI Toolkit by choosing the right base (Turbo + training adapter vs De-Turbo), then tuning dataset, rank/LR/steps, and sampling settings to get stable results.

Train Diffusion Models with Ostris AI Toolkit

Scroll horizontally to see full form

Ostris AI ToolkitOstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Datasets

Dataset 1

Sample

Z‑Image is a 6B‑parameter image generation model from Tongyi‑MAI built on a Scalable Single‑Stream Diffusion Transformer (S3‑DiT). It’s unusually efficient for its size and is designed to run at 1024×1024 on consumer GPUs.

This guide covers the two most common, real-world approaches to Z‑Image LoRA training:

1) Z‑Image Turbo (w/ Training Adapter) — best when you want your LoRA to run with true 8‑step Turbo speed after training.

2) Z‑Image De‑Turbo (De‑Distilled) — best when you want a de‑distilled base you can train without an adapter, or push longer fine-tunes.

By the end of this guide, you’ll be able to:

  • Pick the right Z‑Image base (Turbo+adapter vs De‑Turbo) for your goal.
  • Prepare a dataset that works with Turbo-style distilled training.
  • Configure Ostris AI Toolkit (locally or on RunComfy Cloud AI Toolkit) panel‑by‑panel.
  • Understand why each parameter matters, so you can tune instead of copy‑pasting.
This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this guide.

Quick start (recommended baseline)

Option A — Turbo + training adapter (recommended for most LoRAs)

Use this if you want your LoRA to keep Turbo’s fast 8‑step behavior after training.

Why this matters:

  • Turbo is a distilled "student" model: it compresses a slower multi-step diffusion process into ~8 steps.
  • If you train on Turbo like a normal model, your updates can undo the distillation ("Turbo drift"), and you’ll start needing more steps / more CFG to get the same quality.
  • The training adapter temporarily "de‑distills" Turbo during training so your LoRA learns your concept without breaking Turbo’s 8‑step behavior. At inference you remove the adapter and keep only your LoRA.

Baseline settings:

  1. MODEL → Model Architecture: Z‑Image Turbo (w/ Training Adapter)
  2. MODEL → Name or Path: Tongyi-MAI/Z-Image-Turbo
  3. MODEL → Training Adapter Path:
    • Keep the default if your UI auto-fills it (RunComfy often defaults to v2), or set explicitly:
      • v1: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v1.safetensors
      • v2: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v2.safetensors
  4. TARGET → Linear Rank: 16
  5. TRAINING → Learning Rate: 0.0001
  6. TRAINING → Steps: 2500–3000 (for 10–30 images)
  7. DATASETS → Resolutions: 512 / 768 / 1024 and Cache Latents = ON
  8. SAMPLE (for previews):
    • 1024×1024, 8 steps (or 9 if your pipeline treats 9 as "8 DiT forwards")
    • Guidance scale = 0 (Turbo is guidance‑distilled)
    • Sample every 250 steps

Option B — De‑Turbo (de‑distilled base)

Use this if you want to train without a training adapter or you plan longer training runs.

What changes compared to Turbo:

  • De‑Turbo behaves more like a "normal" diffusion model for training and sampling.
  • You typically sample with more steps and low (but non-zero) CFG.
  1. MODEL → Model Architecture: Z‑Image De‑Turbo (De‑Distilled)
  2. MODEL → Name or Path: ostris/Z-Image-De-Turbo (or whatever your AI Toolkit build pre-selects)
  3. Training Adapter Path: none (not needed)
  4. Keep the same LoRA settings (rank/LR/steps) as a baseline.
  5. SAMPLE (for previews):
    • 20–30 steps
    • CFG (guidance scale) ≈ 2–3
    • Sample every 250 steps
Want zero setup? Use the RunComfy Cloud AI Toolkit and follow the exact same panels.

Table of contents


1. Which Z‑Image base should you train on? (Turbo+adapter vs De‑Turbo)

AI Toolkit exposes two "model architecture" choices for Z‑Image LoRA training:

1.1 Z‑Image Turbo (w/ Training Adapter)

Best for: typical LoRAs (character, style, product), where your end goal is to run inference on Turbo at 8 steps.

Why it exists:

  • Z‑Image Turbo is a step‑distilled model. If you train LoRAs on a step‑distilled model "normally", the distillation can break down fast, and Turbo starts to behave like a slower non‑distilled model (quality shifts, needs more steps, etc.).
  • The training adapter acts like a temporary "de‑distillation LoRA" during training. Your LoRA learns your concept while Turbo’s fast 8‑step behavior stays stable.
  • At inference time, you remove the training adapter and keep your LoRA on top of the real Turbo base.

Practical signals you chose the right path:

  • Your preview samples look good at 8 steps with guidance ≈ 0.
  • Your LoRA doesn’t suddenly start requiring 20–30 steps to look clean (a common sign of Turbo drift).

1.2 Z‑Image De‑Turbo (De‑Distilled)

Best for: training without adapter, or longer fine‑tunes where Turbo+adapter would eventually drift.

What it is:

  • De‑Turbo is a de‑distilled version of Turbo, designed to behave more like a normal diffusion model for training.
  • It can be trained directly without an adapter and also used for inference (typically 20–30 steps with low CFG).

1.3 Quick decision guide

Pick Turbo + training adapter if:

  • You want the LoRA to run at Turbo speed (8 steps) after training.
  • You are doing a normal LoRA run (a few thousand to tens of thousands of steps).

Pick De‑Turbo if:

  • You want "normal model" behavior for training and sampling.
  • You want to train longer, or you’re experimenting with workflows that don’t support the training adapter cleanly.

2. Z‑Image training adapter v1 vs v2 (what changes, when to use)

In the training adapter repo you’ll often see two files:

  • ..._v1.safetensors
  • ..._v2.safetensors

What you need to know (practically):

  • v1 is the safe baseline.
  • v2 is a newer variant that can change training dynamics and results.

Recommendation: treat this as an A/B test:

  • Keep dataset, LR, steps, rank identical
  • Train once with v1, once with v2
  • Compare sample grids at the same checkpoints

If your RunComfy UI defaults to v2 and your training looks stable, just keep it. If you see instability (noise, Turbo drift, weird artifacts), switch to v1.


3. Z‑Image / Z‑Image‑Turbo in a nutshell (for LoRA training)

From the official Z‑Image sources:

  • 6B parameters, S3‑DiT architecture — text tokens, visual semantic tokens and VAE latents are concatenated into one single transformer stream.
  • Model family — Turbo, Base, and Edit variants exist in the Z‑Image series.
  • Turbo specifics — optimized for fast inference; guidance is typically 0 for Turbo inference.

A helpful mental model for LoRA training:

  • High-noise timesteps mostly control composition (layout, pose, global color tone).
  • Low-noise timesteps mostly control details (faces, hands, textures).

This is why timestep settings and bias can noticeably change whether a LoRA feels "global style" vs "identity/detail".


4. Where to train Z‑Image: local vs cloud AI Toolkit

4.1 Local AI Toolkit

The AI Toolkit by Ostris is open source on GitHub. It supports Z‑Image, FLUX, Wan, Qwen and more through a unified UI and config system.

Local makes sense if:

  • You already have an NVIDIA GPU and don’t mind Python / Git setup.
  • You want full control over files, logs and custom changes.

Repo: ostris/ai-toolkit


4.2 RunComfy Cloud AI Toolkit

If you’d rather skip CUDA installs and driver issues, use RunComfy Cloud AI Toolkit:

  • Zero setup — open a browser and train.
  • Consistent VRAM — easier to follow guides without hardware friction.
  • Persistent storage — easier iteration and checkpoint management.

👉 Open it here: Cloud AI Toolkit on RunComfy


5. Designing datasets for Z‑Image LoRA training

5.1 How many images do you actually need?

  • 10–30 images is a good range for most character or style LoRAs.
  • Above ~50 images you often hit diminishing returns unless your style range is very wide.

Z‑Image learns strongly from gradients ("learns hot"), so dataset quality and variety matter more than raw image count:

  • Too few images + too much training often shows up as overfit faces, repeated poses, or messy backgrounds.
  • A small but diverse dataset (angles, lighting, backgrounds) tends to generalize better than a large repetitive one.

5.2 Character vs style LoRAs

Character LoRA

  • Aim for 12–30 images of the same subject.
  • Mix close‑ups and full‑body, angles, lighting, outfits.
  • Captions can be literal and consistent; optional trigger token.

Style LoRA

  • Aim for 15–40 images across varied subjects (people, interiors, landscapes, objects).
  • Caption the scene normally; don’t over-describe the style unless you want it to be trigger-only.
    • This teaches: "render anything in this style," rather than "only do the style when I say a special keyword."

5.3 Captions, trigger word and text files

  • image_01.pngimage_01.txt
  • If there is no .txt, AI Toolkit uses Default Caption.
  • You can use [trigger] in captions and set Trigger Word in the JOB panel.
    • This is especially useful if you later enable DOP (Differential Output Preservation) to make the LoRA more "opt-in".

6. Z‑Image LoRA configuration in AI Toolkit – parameter by parameter

In this section we walk through the UI panels and explain what each important field does.

6.1 JOB panel

  • Training Name — descriptive label like zimage_char_redhair_v1
  • GPU ID — local GPU selector; on cloud keep default
  • Trigger Word (optional)zchar_redhair / zstyle_pencil

6.2 MODEL panel (most important)

This is where the two base choices matter:

If you pick Turbo + adapter

  • Model ArchitectureZ‑Image Turbo (w/ Training Adapter)
  • Name or PathTongyi-MAI/Z-Image-Turbo
  • Training Adapter Path — keep default or choose:
    • v1: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v1.safetensors
    • v2: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v2.safetensors
Tip: if you accidentally train Turbo without the adapter, the most common symptom is that your LoRA "works" only when you raise steps/CFG, which defeats the point of Turbo.

If you pick De‑Turbo

  • Model ArchitectureZ‑Image De‑Turbo (De‑Distilled)
  • Name or Pathostris/Z-Image-De-Turbo
  • Training Adapter Path — none

Options:

  • Low VRAM / Layer Offloading — enable if you’re VRAM constrained

6.3 QUANTIZATION panel

  • On 24+ GB, prefer BF16/none for fidelity
  • On 16 GB, float8 is usually the best trade-off

6.4 TARGET panel – LoRA configuration

  • Target TypeLoRA
  • Linear Rank — start with 8–16
    • 16 for stronger styles/textures
    • 8 for smaller, subtler LoRAs

6.5 SAVE panel

  • Data TypeBF16
  • Save Every250
  • Max Step Saves to Keep4–12

6.6 TRAINING panel – core hyperparameters

  • Batch Size1
  • OptimizerAdamW8Bit
  • Learning Rate — start at 0.0001

    If unstable/noisy, drop to 0.00005–0.00008.

    Avoid pushing too high (e.g. 0.0002+) — Turbo-style models can become unstable quickly.

  • Weight Decay0.0001
  • Steps2500–3000 for 10–30 images

    If your dataset is very small (<10 images), consider 1500–2200 to reduce overfitting.

  • Loss TypeMean Squared Error
  • Timestep TypeWeighted
  • Timestep BiasBalanced
    • Favor High Noise if you want stronger global style / mood.
    • Favor Low Noise if you’re chasing identity/detail (advanced; start with Balanced).
  • EMA — OFF

Text Encoder:

  • Cache Text Embeddings — ON if captions are static and VRAM is tight

    (then set Caption Dropout to 0)

  • Unload TE — keep OFF for caption-driven training

Regularization:

  • DOP — keep OFF for first run; add later for production trigger-only LoRAs

    (DOP is powerful but adds complexity; it’s easiest once you already have a stable baseline.)


6.7 DATASETS panel

  • Caption Dropout Rate
    • 0.05 if not caching text embeddings
    • 0 if caching embeddings
  • Cache Latents — ON
  • Resolutions512 / 768 / 1024 is a strong baseline

6.8 SAMPLE panel (match your base!)

If training Turbo:

  • 1024×1024, 8 steps, guidance = 0, sample every 250

If training De‑Turbo:

  • 1024×1024, 20–30 steps, CFG 2–3, sample every 250

Use 5–10 prompts that reflect real usage; include a couple prompts without the trigger to detect leakage.


6.9 ADVANCED panel – Differential Guidance (optional)

  • Do Differential Guidance — ON if you want faster convergence
  • Scale — start at 3

    If samples look overly sharp/noisy early, reduce to 2. If learning is slow, you can test 4 later.


7. Practical recipes for Z‑Image LoRA training

A strong baseline for Turbo LoRAs:

  • Turbo + training adapter (v1 or v2)
  • rank=16, lr=1e-4, steps=2500–3000
  • 512/768/1024 buckets, cache latents ON
  • samples every 250 steps, 8 steps, guidance 0

If your LoRA feels "too strong":

  • Keep training the same, but plan to run inference at a lower LoRA weight (e.g. 0.6–0.8).

8. Troubleshooting

"My LoRA destroyed Turbo—now I need more steps / CFG."

  • Most common causes:
    • trained on Turbo without the training adapter, or
    • LR too high for too long.
  • Fix:
    • use Turbo + training adapter architecture
    • keep LR ≤ 1e‑4
    • reduce steps if you see drift early

"The style is too strong."

  • Lower LoRA weight at inference (0.6–0.8)
  • Use trigger + DOP for production LoRAs (opt‑in behavior)

"Hands/backgrounds are messy."

  • Add a few images that include those cases
  • Consider slightly favoring low-noise timesteps (advanced)

"Out of VRAM / too slow."

  • Disable high buckets (keep 512–1024)
  • Enable Low VRAM + offloading
  • Quantize to float8
  • Cache latents (and optionally cache text embeddings)

9. Use your Z‑Image LoRA


FAQ

Should I use the Z‑Image training adapter v1 or v2?

Start with your UI default. If results are unstable or you see Z‑Image Turbo drift, test the other version with all other settings held constant.

Should I train Z‑Image on Turbo+adapter or De‑Turbo?

Turbo+adapter for most Z‑Image LoRAs that must keep 8‑step Turbo behavior. De‑Turbo if you want adapter‑free training or longer fine‑tunes.

What Z‑Image inference settings should I use after training?

Z‑Image Turbo typically uses low/no CFG and ~8 steps. De‑Turbo behaves more like a normal model (20–30 steps, low CFG). Always match your sampling settings to the base you’re actually using.


More AI Toolkit LoRA training guides

Ready to start training?