Z-Image LoRA Training (Z-Image Turbo + De-Turbo) with Ostris AI Toolkit

Z‑Image is a 6B‑parameter image generation model from Tongyi‑MAI built on a Scalable Single‑Stream Diffusion Transformer (S3‑DiT). It’s unusually efficient for its size and is designed to run at 1024×1024 on consumer GPUs.

This guide covers the two most common, real-world approaches to Z‑Image LoRA training:

1) Z‑Image Turbo (w/ Training Adapter) — best when you want your LoRA to run with true 8‑step Turbo speed after training.

2) Z‑Image De‑Turbo (De‑Distilled) — best when you want a de‑distilled base you can train without an adapter, or push longer fine-tunes.

By the end of this guide, you’ll be able to:

Pick the right Z‑Image base (Turbo+adapter vs De‑Turbo) for your goal.
Prepare a dataset that works with Turbo-style distilled training.
Configure Ostris AI Toolkit (locally or on RunComfy Cloud AI Toolkit) panel‑by‑panel.
Understand why each parameter matters, so you can tune instead of copy‑pasting.

This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this guide.

Quick start (recommended baseline)

Option A — Turbo + training adapter (recommended for most LoRAs)

Use this if you want your LoRA to keep Turbo’s fast 8‑step behavior after training.

Why this matters:

Turbo is a distilled "student" model: it compresses a slower multi-step diffusion process into ~8 steps.
If you train on Turbo like a normal model, your updates can undo the distillation ("Turbo drift"), and you’ll start needing more steps / more CFG to get the same quality.
The training adapter temporarily "de‑distills" Turbo during training so your LoRA learns your concept without breaking Turbo’s 8‑step behavior. At inference you remove the adapter and keep only your LoRA.

Baseline settings:

MODEL → Model Architecture: Z‑Image Turbo (w/ Training Adapter)
MODEL → Name or Path: Tongyi-MAI/Z-Image-Turbo
MODEL → Training Adapter Path:

Keep the default if your UI auto-fills it (RunComfy often defaults to v2), or set explicitly:

v1: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v1.safetensors
v2: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v2.safetensors

TARGET → Linear Rank: 16
TRAINING → Learning Rate: 0.0001
TRAINING → Steps: 2500–3000 (for 10–30 images)
DATASETS → Resolutions: 512 / 768 / 1024 and Cache Latents = ON
SAMPLE (for previews):

1024×1024, 8 steps (or 9 if your pipeline treats 9 as "8 DiT forwards")
Guidance scale = 0 (Turbo is guidance‑distilled)
Sample every 250 steps

Option B — De‑Turbo (de‑distilled base)

Use this if you want to train without a training adapter or you plan longer training runs.

What changes compared to Turbo:

De‑Turbo behaves more like a "normal" diffusion model for training and sampling.
You typically sample with more steps and low (but non-zero) CFG.

MODEL → Model Architecture: Z‑Image De‑Turbo (De‑Distilled)
MODEL → Name or Path: ostris/Z-Image-De-Turbo (or whatever your AI Toolkit build pre-selects)
Training Adapter Path: none (not needed)
Keep the same LoRA settings (rank/LR/steps) as a baseline.
SAMPLE (for previews):

20–30 steps
CFG (guidance scale) ≈ 2–3
Sample every 250 steps

Want zero setup? Use the RunComfy Cloud AI Toolkit and follow the exact same panels.

1. Which Z‑Image base should you train on? (Turbo+adapter vs De‑Turbo)
2. Z‑Image training adapter v1 vs v2 (what changes, when to use)
3. Z‑Image / Z‑Image‑Turbo in a nutshell (for LoRA training)
4. Where to train Z‑Image: local vs cloud AI Toolkit
5. Designing datasets for Z‑Image LoRA training
6. Z‑Image LoRA configuration in AI Toolkit – parameter by parameter
7. Practical recipes for Z‑Image LoRA training
8. Troubleshooting (Turbo drift, overfit, VRAM, sampling)
9. Export and use your Z‑Image LoRA
FAQ

1. Which Z‑Image base should you train on? (Turbo+adapter vs De‑Turbo)

AI Toolkit exposes two "model architecture" choices for Z‑Image LoRA training:

1.1 Z‑Image Turbo (w/ Training Adapter)

Best for: typical LoRAs (character, style, product), where your end goal is to run inference on Turbo at 8 steps.

Why it exists:

Z‑Image Turbo is a step‑distilled model. If you train LoRAs on a step‑distilled model "normally", the distillation can break down fast, and Turbo starts to behave like a slower non‑distilled model (quality shifts, needs more steps, etc.).
The training adapter acts like a temporary "de‑distillation LoRA" during training. Your LoRA learns your concept while Turbo’s fast 8‑step behavior stays stable.
At inference time, you remove the training adapter and keep your LoRA on top of the real Turbo base.

Practical signals you chose the right path:

Your preview samples look good at 8 steps with guidance ≈ 0.
Your LoRA doesn’t suddenly start requiring 20–30 steps to look clean (a common sign of Turbo drift).

1.2 Z‑Image De‑Turbo (De‑Distilled)

Best for: training without adapter, or longer fine‑tunes where Turbo+adapter would eventually drift.

What it is:

De‑Turbo is a de‑distilled version of Turbo, designed to behave more like a normal diffusion model for training.
It can be trained directly without an adapter and also used for inference (typically 20–30 steps with low CFG).

1.3 Quick decision guide

Pick Turbo + training adapter if:

You want the LoRA to run at Turbo speed (8 steps) after training.
You are doing a normal LoRA run (a few thousand to tens of thousands of steps).

Pick De‑Turbo if:

You want "normal model" behavior for training and sampling.
You want to train longer, or you’re experimenting with workflows that don’t support the training adapter cleanly.

2. Z‑Image training adapter v1 vs v2 (what changes, when to use)

In the training adapter repo you’ll often see two files:

..._v1.safetensors
..._v2.safetensors

What you need to know (practically):

v1 is the safe baseline.
v2 is a newer variant that can change training dynamics and results.

Recommendation: treat this as an A/B test:

Keep dataset, LR, steps, rank identical
Train once with v1, once with v2
Compare sample grids at the same checkpoints

If your RunComfy UI defaults to v2 and your training looks stable, just keep it. If you see instability (noise, Turbo drift, weird artifacts), switch to v1.

3. Z‑Image / Z‑Image‑Turbo in a nutshell (for LoRA training)

From the official Z‑Image sources:

6B parameters, S3‑DiT architecture — text tokens, visual semantic tokens and VAE latents are concatenated into one single transformer stream.
Model family — Turbo, Base, and Edit variants exist in the Z‑Image series.
Turbo specifics — optimized for fast inference; guidance is typically 0 for Turbo inference.

A helpful mental model for LoRA training:

High-noise timesteps mostly control composition (layout, pose, global color tone).
Low-noise timesteps mostly control details (faces, hands, textures).

This is why timestep settings and bias can noticeably change whether a LoRA feels "global style" vs "identity/detail".

4. Where to train Z‑Image: local vs cloud AI Toolkit

4.1 Local AI Toolkit

The AI Toolkit by Ostris is open source on GitHub. It supports Z‑Image, FLUX, Wan, Qwen and more through a unified UI and config system.

Local makes sense if:

You already have an NVIDIA GPU and don’t mind Python / Git setup.
You want full control over files, logs and custom changes.

Repo: ostris/ai-toolkit

4.2 RunComfy Cloud AI Toolkit

If you’d rather skip CUDA installs and driver issues, use RunComfy Cloud AI Toolkit:

Zero setup — open a browser and train.
Consistent VRAM — easier to follow guides without hardware friction.
Persistent storage — easier iteration and checkpoint management.

👉 Open it here: Cloud AI Toolkit on RunComfy

5. Designing datasets for Z‑Image LoRA training

5.1 How many images do you actually need?

10–30 images is a good range for most character or style LoRAs.
Above ~50 images you often hit diminishing returns unless your style range is very wide.

Z‑Image learns strongly from gradients ("learns hot"), so dataset quality and variety matter more than raw image count:

Too few images + too much training often shows up as overfit faces, repeated poses, or messy backgrounds.
A small but diverse dataset (angles, lighting, backgrounds) tends to generalize better than a large repetitive one.

5.2 Character vs style LoRAs

Character LoRA

Aim for 12–30 images of the same subject.
Mix close‑ups and full‑body, angles, lighting, outfits.
Captions can be literal and consistent; optional trigger token.

Style LoRA

Aim for 15–40 images across varied subjects (people, interiors, landscapes, objects).
Caption the scene normally; don’t over-describe the style unless you want it to be trigger-only.

This teaches: "render anything in this style," rather than "only do the style when I say a special keyword."

5.3 Captions, trigger word and text files

image_01.png → image_01.txt
If there is no .txt, AI Toolkit uses Default Caption.
You can use [trigger] in captions and set Trigger Word in the JOB panel.

This is especially useful if you later enable DOP (Differential Output Preservation) to make the LoRA more "opt-in".

6. Z‑Image LoRA configuration in AI Toolkit – parameter by parameter

In this section we walk through the UI panels and explain what each important field does.

6.1 JOB panel

Training Name — descriptive label like zimage_char_redhair_v1
GPU ID — local GPU selector; on cloud keep default
Trigger Word (optional) — zchar_redhair / zstyle_pencil

6.2 MODEL panel (most important)

This is where the two base choices matter:

If you pick Turbo + adapter

Model Architecture — Z‑Image Turbo (w/ Training Adapter)
Name or Path — Tongyi-MAI/Z-Image-Turbo

This is the Hugging Face model id (repo id). In most AI Toolkit builds, selecting the model architecture will auto‑fill it; leave it as-is unless you have a reason to change it.
If you override it, use Hugging Face repo id format: org-or-user/model-name (optionally org-or-user/model-name@revision).

Training Adapter Path — keep default or choose:

v1: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v1.safetensors
v2: ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v2.safetensors

Tip: if you accidentally train Turbo without the adapter, the most common symptom is that your LoRA "works" only when you raise steps/CFG, which defeats the point of Turbo.

If you pick De‑Turbo

Model Architecture — Z‑Image De‑Turbo (De‑Distilled)
Name or Path — ostris/Z-Image-De-Turbo

This is the Hugging Face model id (repo id). In most AI Toolkit builds, selecting the model architecture will auto‑fill it; leave it as-is unless you have a reason to change it.
If you override it, use Hugging Face repo id format: org-or-user/model-name (optionally org-or-user/model-name@revision).

Training Adapter Path — none

Options:

Low VRAM / Layer Offloading — enable if you’re VRAM constrained

6.3 QUANTIZATION panel

On 24+ GB, prefer BF16/none for fidelity
On 16 GB, float8 is usually the best trade-off

6.4 TARGET panel – LoRA configuration

Target Type — LoRA
Linear Rank — start with 8–16

16 for stronger styles/textures
8 for smaller, subtler LoRAs

6.5 SAVE panel

Data Type — BF16
Save Every — 250
Max Step Saves to Keep — 4–12

6.6 TRAINING panel – core hyperparameters

Batch Size — 1
Optimizer — AdamW8Bit
Learning Rate — start at 0.0001
If unstable/noisy, drop to 0.00005–0.00008.

Avoid pushing too high (e.g. 0.0002+) — Turbo-style models can become unstable quickly.
Weight Decay — 0.0001
Steps — 2500–3000 for 10–30 images
If your dataset is very small (<10 images), consider 1500–2200 to reduce overfitting.
Loss Type — Mean Squared Error
Timestep Type — Weighted
Timestep Bias — Balanced

Favor High Noise if you want stronger global style / mood.
Favor Low Noise if you’re chasing identity/detail (advanced; start with Balanced).

EMA — OFF

Text Encoder:

Cache Text Embeddings — ON if captions are static and VRAM is tight
(then set Caption Dropout to 0)
Unload TE — keep OFF for caption-driven training

Regularization:

DOP — keep OFF for first run; add later for production trigger-only LoRAs
(DOP is powerful but adds complexity; it’s easiest once you already have a stable baseline.)

6.7 DATASETS panel

Caption Dropout Rate

0.05 if not caching text embeddings
0 if caching embeddings

Cache Latents — ON
Resolutions — 512 / 768 / 1024 is a strong baseline

6.8 SAMPLE panel (match your base!)

If training Turbo:

1024×1024, 8 steps, guidance = 0, sample every 250

If training De‑Turbo:

1024×1024, 20–30 steps, CFG 2–3, sample every 250

Use 5–10 prompts that reflect real usage; include a couple prompts without the trigger to detect leakage.

6.9 ADVANCED panel – Differential Guidance (optional)

Do Differential Guidance — ON if you want faster convergence
Scale — start at 3
If samples look overly sharp/noisy early, reduce to 2. If learning is slow, you can test 4 later.

7. Practical recipes for Z‑Image LoRA training

A strong baseline for Turbo LoRAs:

Turbo + training adapter (v1 or v2)
rank=16, lr=1e-4, steps=2500–3000
512/768/1024 buckets, cache latents ON
samples every 250 steps, 8 steps, guidance 0

If your LoRA feels "too strong":

Keep training the same, but plan to run inference at a lower LoRA weight (e.g. 0.6–0.8).

8. Troubleshooting

"My LoRA destroyed Turbo—now I need more steps / CFG."

Most common causes:

trained on Turbo without the training adapter, or
LR too high for too long.

Fix:

use Turbo + training adapter architecture
keep LR ≤ 1e‑4
reduce steps if you see drift early

"The style is too strong."

Lower LoRA weight at inference (0.6–0.8)
Use trigger + DOP for production LoRAs (opt‑in behavior)

"Hands/backgrounds are messy."

Add a few images that include those cases
Consider slightly favoring low-noise timesteps (advanced)

"Out of VRAM / too slow."

Disable high buckets (keep 512–1024)
Enable Low VRAM + offloading
Quantize to float8
Cache latents (and optionally cache text embeddings)

FAQ

Should I use the Z‑Image training adapter v1 or v2?

Start with your UI default. If results are unstable or you see Z‑Image Turbo drift, test the other version with all other settings held constant.

Should I train Z‑Image on Turbo+adapter or De‑Turbo?

Turbo+adapter for most Z‑Image LoRAs that must keep 8‑step Turbo behavior. De‑Turbo if you want adapter‑free training or longer fine‑tunes.

What Z‑Image inference settings should I use after training?

Z‑Image Turbo typically uses low/no CFG and ~8 steps. De‑Turbo behaves more like a normal model (20–30 steps, low CFG). Always match your sampling settings to the base you’re actually using.

9. Use your Z‑Image LoRA

Run LoRA — open the Z‑Image Turbo Run LoRA page. On this base‑model inference page, you can select a LoRA asset you trained on RunComfy or import a LoRA file you trained with AI Toolkit, then run inference via the playground or the API. RunComfy uses the same base model and the full AI Toolkit pipeline definition from your training config, so what you saw during training is what you get in inference—this tight training/inference alignment helps keep inference results consistent with your training samples.
ComfyUI workflows — load your LoRA into a workflow like Z‑Image workflow in ComfyUI

OstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Advanced

Datasets

Dataset 1

Sample