Z-Image Turbo LoRA Training with AI Toolkit

Z‑Image is a 6‑billion‑parameter image generation model from Tongyi‑MAI built on a Scalable Single‑Stream Diffusion Transformer (S3‑DiT). Text tokens and image tokens travel through a single transformer, which makes Z‑Image unusually efficient for its size and lets it run at 1024×1024 on consumer GPUs.

The variant you’ll train on here is Z‑Image‑Turbo – an 8‑step, guidance‑free distilled model that still hits photorealistic quality and strong bilingual (EN/zh) text rendering while fitting comfortably into 16–24 GB of VRAM.

By the end of this guide, you’ll be able to:

Understand what makes Z‑Image Turbo different from SDXL / FLUX / Wan.
Prepare a dataset that works well with its distilled 8‑step schedule.
Configure AI Toolkit (locally or on RunComfy cloud AI Toolkit) panel‑by‑panel.
Know why each parameter matters, so you can tune it instead of just copying.

This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this FLUX.2 [dev] guide.

1. Z‑Image / Z‑Image‑Turbo in a nutshell (for LoRA training)
2. Where to train Z‑Image: local vs cloud AI Toolkit
3. Designing datasets for Z‑Image LoRA training
4. Z‑Image LoRA configuration in AI Toolkit – parameter by parameter
5. Practical recipes for Z‑Image LoRA training
6. Troubleshooting for Z‑Image LoRA training
7. Export and use your Z‑Image Turbo LoRA

1. Z‑Image / Z‑Image‑Turbo in a nutshell (for LoRA training)

1.1 Architecture and capabilities

From the official Z‑Image GitHub repository and the Z‑Image‑Turbo model card on Hugging Face:

6B parameters, S3‑DiT architecture – text, visual semantic tokens and VAE latents are concatenated into one sequence. This single‑stream design is why Z‑Image can match or beat much larger models while running on 16 GB GPUs.
Three variants – Base, Turbo, and Edit. At the time of writing, Turbo is the main open‑weights model for inference and custom LoRAs; Base and Edit are coming for heavier finetunes.
Turbo specifics – 8 effective denoising steps, guidance‑distilled (recommended guidance_scale = 0 at inference), optimized for 1024×1024 but flexible in aspect ratio.

1.2 Why LoRAs instead of full finetunes

LoRAs make much more sense than full model finetunes for Z‑Image Turbo:

A LoRA file is tiny compared to a 6B checkpoint and can be swapped or combined easily.
You don’t have to host your own fork of Tongyi-MAI/Z-Image-Turbo—you just load the base model and attach your LoRA.
Training is orders of magnitude cheaper: you only update small rank‑reduced matrices instead of the entire transformer.

For most real projects (characters, styles, products) you get 90–95% of the benefit with a well‑tuned LoRA.

1.3 Why Z‑Image Turbo needs a training adapter

Z‑Image Turbo is not a normal diffusion model. It’s a distilled student that learned to copy a larger "teacher" model which used 20–50 steps with classifier‑free guidance (CFG) and a different noise schedule. Turbo squeezes that behaviour into 8 single‑pass steps so it generates images quickly.

If you train LoRAs on Turbo as if it were a regular diffusion model, your updates don’t just teach it your new style or character – they also start to undo the distillation:

The model slowly drifts back toward the original 20–50‑step, CFG‑style behaviour it was distilled from.
In practice, people see Turbo’s sharp 8‑step images degrade after a few hundred training steps when they train LoRAs directly on it without any extra tricks: more noise, mushy details, and "broken" Turbo quality.

To avoid that, AI Toolkit includes a dedicated Z‑Image Turbo training adapter (ostris/zimage_turbo_train). This adapter:

Temporarily "de‑distills" Turbo during training, so under the hood it behaves more like the original slow teacher model. Your LoRA now sees a stable, teacher‑like diffusion process instead of the compressed 8‑step student.
Is automatically merged into the base model at the start of training, and removed again at inference time, so when you actually use your LoRA you still get true 8‑step Turbo speed and behaviour.

You don’t have to manage this adapter yourself. In the AI Toolkit UI, simply choose "Z‑Image Turbo (with training adapter)" in the Model Architecture field, and the toolkit will wire everything up for you.

The adapter is designed for normal LoRA‑scale runs (a few thousand to tens of thousands of steps) where you’re adding a style, character or concept. It is not meant for giant multi‑million‑step full‑model fine‑tunes – if you push it that far, Turbo will eventually start to lose its distilled behaviour no matter what.

2. Where to train Z‑Image: local vs cloud AI Toolkit

2.1 Local AI Toolkit

The AI Toolkit by Ostris is open source on GitHub. It supports Z‑Image, FLUX, Wan, Qwen and many more models through a unified UI and config system.

Running locally makes sense if:

You already have a 24–48 GB NVIDIA GPU and don’t mind doing Python / Git setup.
You want full control over data, logs and custom changes to AI Toolkit.

The GitHub README walks you through installing Python, cloning the repo and starting the Gradio UI. (GitHub)

2.2 RunComfy cloud AI Toolkit

If you’d rather skip CUDA installs and driver issues, use RunComfy’s cloud AI Toolkit environment.

In practice this gives you:

Zero setup – just open a browser and start using the same UI as the local version. The backend GPU (A100/H100‑class) is already configured.
Consistent VRAM – it’s much easier to follow guides like this when you’re not fighting random local hardware limitations.
Persistent storage for datasets and checkpoints – ideal when you’re iterating on multiple Z‑Image LoRAs or maintaining a small internal library.

3. Designing datasets for Z‑Image LoRA training

3.1 How many images do you actually need?

Z‑Image Turbo is efficient but not "magic": LoRA sample complexity depends more on what you want to teach than on the base model’s parameter count.

From Z‑Image LoRA tutorials and real training runs:

10–30 images is a good range for most character or style LoRAs.
Above ~50 images you usually get diminishing returns unless you’re covering a huge stylistic range.

Why this many if the model is "only" 6B?

The LoRA is learning a difference between base Z‑Image and your concept. It has to see the concept under different poses, lighting and backgrounds to separate "your character / style" from "random person / scene".
Z‑Image "learns hot" – it responds strongly to gradients. With too few images, a small dataset plus a strong learning rate can lead to overfitting and weird artefacts.

So the model is efficient, but you still need enough variety to tell it what to ignore.

3.2 Character vs style LoRAs

Character LoRA

Aim for 12–30 images of the same person / subject.
Include a mix of close‑ups and full‑body shots, multiple outfits, angles and lighting.
Captions can be fairly literal: "a photo of [name] wearing a blue jacket, studio lighting". You can optionally add a Trigger Word like zimage_char01 into the caption and JOB panel if you want explicit control.

Style LoRA

Aim for 15–40 images covering different subjects in that style (people, interiors, landscapes, objects).
Captions work best when you pretend the style is normal: "a woman holding a coffee cup, in a beanie, sitting at a cafe", not "child’s crayon drawing of a woman…".
This teaches the model: "whenever someone asks for a woman / house / owl, render it in this style," rather than requiring a special keyword every time.

Trigger words are still useful, for example if you want the style to be opt‑in only, but Z‑Image Turbo handles "always‑on" style LoRAs very well.

3.3 Captions, trigger word and text files

AI Toolkit expects a folder of images plus .txt files with the same base name, just like its other model guides describe.

If image_01.png exists, AI Toolkit will look for image_01.txt.
If there is no .txt, it uses the Default Caption you set in the DATASETS panel.
You can insert [trigger] into captions (for example: [trigger] a portrait of a woman in soft lighting). AI Toolkit replaces [trigger] with the JOB panel Trigger Word at training time, which is important if you later use Differential Output Preservation (DOP).

4. Z‑Image LoRA configuration in AI Toolkit – parameter by parameter

In this section we walk through the UI panels you see in your screenshots and explain what each important field actually does and how it affects Z‑Image Turbo.

The values in italics are a solid baseline for a first Z‑Image Turbo LoRA on a 24 GB GPU.

4.1 JOB panel

Training Name – A label used for the output folder (checkpoints, samples, logs). Use something descriptive like zimage_char_redhair_v1 so you can find it later.
GPU ID – On a local install this selects your physical GPU. On the cloud AI Toolkit on RunComfy, leave this at the default; the actual GPU type (H100 / H200, etc.) is chosen later when you start the job from the Training Queue.
Trigger Word (optional) – A token like zchar_redhair or zstyle_pencil.
When you combine Trigger Word + [trigger] in captions and (optionally) Differential Output Preservation, AI Toolkit can teach the LoRA to only activate when the trigger is present and preserve base Z‑Image behaviour otherwise.

For always‑on style LoRAs where you caption images as if the style is normal, you can leave this blank.

4.2 MODEL panel

Model Architecture – Choose "Z‑Image Turbo (w/ training adapter)" (the exact wording may vary slightly). This sets internal hooks so the training adapter is merged at the start and removed again for inference.
Name or Path – Sets the base model path. Leave it at the default Tongyi-MAI/Z-Image-Turbo to download the official Z‑Image Turbo checkpoint from Hugging Face, or point it to a local path if you want to use a custom Z‑Image Turbo base.
Training Adapter Path – For Turbo, leave the default ostris/zimage_turbo_train. This is the de‑distillation LoRA described earlier; changing or removing it is the fastest way to break Turbo’s 8‑step behaviour.
Options → Low VRAM / Layer Offloading – These toggle more aggressive offloading strategies:

Turn Low VRAM = ON and Layer Offloading = ON on 16 GB cards or when training at 1024×1024 with quantization.
On 24–48 GB GPUs you can leave them OFF to keep training simpler and faster.

4.3 QUANTIZATION panel

AI Toolkit lets you load the base model in reduced precision so LoRA training fits smaller GPUs.

Transformer – For most users:

On 24+ GB VRAM, set this to none or BF16 for maximum training fidelity (Z‑Image is usually run in bfloat16 for inference as recommended in the official code).
On 16 GB VRAM, keep it at float8 (default). This 8‑bit quantization cuts memory dramatically while still working well for LoRA training in practice.

Text Encoder – Same logic as the transformer:

Keep float8 on tight VRAM.
Use BF16 or none if you have plenty of VRAM and want the cleanest gradients, especially for text‑heavy concepts where prompt wording matters a lot.

Quantizing the base model doesn’t change your LoRA precision, those updates are still stored in higher precision, so it’s usually a good trade‑off unless you are debugging very subtle artefacts.

4.4 TARGET panel – LoRA configuration

Target Type – Choose LoRA.
Linear Rank – Controls the capacity of the LoRA and its file size.
For Z‑Image Turbo, a good starting point is 8-16:

Higher rank (16) captures more complex styles and textures (great for very textured pencil / oil‑paint styles) at the cost of larger files and slightly higher VRAM.
Lower rank (8) produces smaller, more subtle LoRAs but can underfit if your style is extreme.

4.5 SAVE panel

Data Type – Use BF16. Z‑Image’s own examples load the model in bfloat16, and saving LoRAs in the same dtype keeps file sizes manageable and avoids conversion issues.
Save Every – How often AI Toolkit writes a checkpoint (in steps).
A value like 250 is a good balance: you get 10–12 checkpoints over a 2500–3000‑step run without filling the disk.
Max Step Saves to Keep – How many checkpoints are kept before old ones are deleted.
4 is fine if you’re watching the run live; bump it to 8–12 if you like to "let it cook" overnight and choose the best checkpoint later.

4.6 TRAINING panel – core hyperparameters

These are the settings that most strongly affect training dynamics.

Batch Size – Keep this at 1 for Z‑Image LoRAs. With 1024×1024 images and the training adapter, batch sizes above 1 start to demand a lot of VRAM; increasing Steps is usually a safer way to improve quality.
Gradient Accumulation – For typical runs, leave it at 1. You can treat Batch Size × Grad Accum as your effective batch; increasing grad accumulation lets you simulate bigger batches at the cost of slower wall‑clock time. Most Z‑Image LoRAs don’t need this unless you’re doing very heavy regularization.
Optimizer – Use AdamW8Bit.
8‑bit Adam drastically reduces memory usage while acting like standard AdamW; it’s the recommended optimizer in other AI Toolkit docs and Z‑Image training walkthroughs.
Learning Rate – For Z‑Image Turbo, 0.0001 is the sweet spot:

This is a standard diffusion LR and works well with LoRA updates.
Pushing it to 0.0002 (or higher) has been observed to "explode the model"—Turbo aggressively un‑distills and quality collapses.
If your samples look noisy or unstable, drop to 0.00005–0.00008 instead of going higher.

Weight Decay – Set 0.0001. This mild regularizer helps keep LoRA weights from drifting too far when training on small datasets.
Steps – How long to train.

For 10–30 images, start with 2500–3000 steps.
For very small datasets (<10 images), use 1500–2200 to avoid overfitting.
Z‑Image "learns hot", so it often reaches usable quality surprisingly early. Start sampling at 250 steps and keep an eye on when improvement slows down.

Loss Type – Keep Mean Squared Error. Z‑Image’s distillation and DMDR finetuning are based on squared‑error objectives; using the same loss means your LoRA optimizes the same quantity as the base model and behaves predictably across timesteps.
Timestep Type – This tells AI Toolkit which part of the diffusion trajectory to sample more often.
Under the hood, diffusion can be thought of as moving from high noise → low noise over ~1000 notional steps. Early (high noise) steps control composition, global colour and tone, while late (low noise) steps refine details like eyes, fingers and textures.

AI Toolkit offers several shapes (Linear, Sigmoid, Shift, Weighted). For Z‑Image LoRAs:

Use Weighted as your default. It’s an AI‑Toolkit‑tuned curve that gives good coverage while leaning into the timesteps that matter most for typical style/character LoRAs.
Sigmoid clusters training around mid‑noise—useful for very detail‑heavy characters.
Linear is a neutral "even coverage" option if you suspect Weighted is doing something unusual.

Timestep Bias – Works with Timestep Type to decide where the LoRA focuses:

Balanced – spread training across early and late steps; safest default for most Z‑Image LoRAs.
Favor High Noise – emphasizes early steps (composition, global style). Use this if you’re building a strong style LoRA that should rewrite colour grading and overall mood more than micro‑details.
Favor Low Noise – emphasizes late steps (faces, textures). This is mostly for highly identity‑focused character LoRAs, and even then it’s advanced tuning—Balanced is still recommended first.

EMA (Use EMA) – Leave OFF. EMA averages weights over time and is more useful for full‑model training; for LoRAs it mainly costs VRAM and adds complexity.

Text Encoder Optimizations

These control how the text encoder is handled in VRAM.

Unload TE – When ON, AI Toolkit uses a static embedding (typically from your Trigger Word) and ignores dataset captions, unloading the text encoder between steps. That’s great for trigger‑only LoRAs, but wrong for caption‑driven Z‑Image LoRAs where every image has its own description. Keep it OFF unless you know you’re doing pure trigger training.
Cache Text Embeddings – When ON, AI Toolkit encodes each caption once, saves the embeddings, and then frees the text encoder. This is very helpful if you:

have static captions, and
are not using Differential Output Preservation or dynamic caption tricks.

For a first Z‑Image LoRA with modest VRAM, you can take one of two safe patterns:

VRAM‑rich: Cache Text Embeddings = OFF, Unload TE = OFF, Caption Dropout Rate ≈ 0.05.
VRAM‑constrained: Cache Text Embeddings = ON, Unload TE = OFF, Caption Dropout Rate = 0.

Regularization – Differential Output Preservation (DOP) & Blank Prompt Preservation

Differential Output Preservation – DOP compares the base model’s prediction with and without the LoRA and penalises differences except where your trigger / captions say "change this".
For Z‑Image Turbo:

Turn DOP OFF for your very first LoRA. Turbo already has strong base behaviour and the training adapter; DOP roughly doubles compute and complicates text‑encoder settings.
Turn DOP ON later if you build production LoRAs that must leave non‑trigger prompts almost identical to base Z‑Image (for example commercial product LoRAs). Then you’ll need a Trigger Word and you must keep Cache Text Embeddings OFF so AI Toolkit can re‑encode the rewritten prompts.

Blank Prompt Preservation – A niche setting that protects behaviour for empty prompts. You can safely leave this OFF for Z‑Image LoRAs.

4.7 DATASETS panel – tying it together

Inside Dataset 1 you’ll see several key fields.

Target Dataset – Choose the folder containing your training images and captions.
LoRA Weight – For a single dataset, leave at 1. If you add extra datasets later (eg. a regularization dataset of generic people), you can rebalance their influence via LoRA Weight.
Default Caption – Used only when an image has no .txt file. For trigger‑based setups you might use something like:
"[trigger] a portrait of a woman with red hair, natural light".
Caption Dropout Rate – Probability that captions are dropped on a given step:

0.05 is a good starting point when you are not caching text embeddings; roughly one in twenty steps uses a blank caption, which stops the LoRA from overfitting to exact wording.
Set this to 0 if you turn Cache Text Embeddings ON, because dropout requires re‑encoding captions each step and doesn’t work correctly with cached embeddings.

Settings → Cache Latents – Turn this ON. AI Toolkit encodes your images into VAE latents once and then trains purely in latent space, removing the heavy VAE from VRAM and speeding up training.
Settings → Is Regularization – Leave this OFF for your main dataset. If you later add a second dataset just to keep the model grounded (e.g. generic people photos), you’d mark that dataset as regularization so the LoRA doesn’t drift too far.
Resolutions – Buckets like 512 / 768 / 1024 / 1280 / 1536.
Z‑Image is happiest at 1024×1024, but leaving multiple buckets lets AI Toolkit automatically match your original aspect ratios without cropping. For a first LoRA, you can:

Enable 512, 768, 1024 if your dataset contains a mix of portrait and landscape images.
Keep only 1024 if you curated everything as near‑square and want simpler behaviour.

4.8 SAMPLE panel – how you monitor learning

Sampling doesn’t affect training gradients directly, but it’s how you decide when to stop and which checkpoint to publish.

Key fields:

Sample Every – How many steps between sample grids. 250 works well for 2500–3000‑step training; you’ll see ~10–12 snapshots.
Sampler – Use FlowMatch or whichever sampler AI Toolkit pre‑selects for Z‑Image Turbo. It’s tuned to match the internal noise schedule so previews reflect the real model behaviour.
Width / Height – Set these to 1024 × 1024 unless you’re training solely for another resolution.
Guidance Scale – Leave at 1 for training samples. Z‑Image‑Turbo’s official inference examples use guidance_scale = 0 (since it is guidance‑distilled), but AI Toolkit’s trainer internally handles guidance for Turbo models, so you usually don’t need to touch this.
Sample Steps – Use 8 to match Turbo’s 8‑step design. Training previews should use the same step count you’ll use in production.
Sample Prompts – Here you add 5–10 prompts that represent how you plan to use the LoRA:
"woman with red hair, playing chess at the park", "a woman holding a coffee cup, in a beanie, sitting at a cafe", etc. In character LoRAs include some prompts that don’t mention the character to check for style or identity leakage.

If you enable Walk Seed, AI Toolkit will slightly vary seeds between samples so you see a range of poses instead of the same composition every time.

4.9 ADVANCED panel – Differential Guidance (optional but powerful)

Differential Guidance is an AI‑Toolkit‑specific trick that effectively amplifies the "you’re wrong here" signal without increasing the learning rate.

Do Differential Guidance – Toggle ON if you want faster convergence, especially on challenging styles.
Differential Guidance Scale – Start with 3.

Higher values push the model harder toward (and slightly past) the target; combined with a high LR they can cause instability.
If your samples look overly sharp or noisy early in training, drop to 2. If learning feels slow even at 0.0001 learning rate, you can experiment with 4 later.

For many Z‑Image Turbo LoRAs, enabling Differential Guidance at scale 3 with LR 0.0001 is a very effective combo; it’s the same trick used to train the Turbo adapter itself.

5. Practical recipes for Z‑Image LoRA training

Putting everything together, a solid starting configuration is:

Model: Tongyi-MAI/Z-Image-Turbo, Z‑Image Turbo (w/ training adapter), adapter path ostris/zimage_turbo_train.
Quantization: Transformer = float8 or BF16 (if you have VRAM), Text Encoder = same.
LoRA: Target Type = LoRA, Linear Rank = 16.
Training:

Batch Size = 1, Grad Accum = 1.
Optimizer = AdamW8Bit.
Learning Rate = 0.0001.
Steps = 2500–3000.
Weight Decay = 0.0001.
Loss Type = Mean Squared Error.
Timestep Type = Weighted.
Timestep Bias = Balanced.
EMA = OFF.
Cache Text Embeddings = OFF (first run), Unload TE = OFF.

Dataset:

15–30 images with .txt captions.
Caption Dropout Rate = 0.05.
Cache Latents = ON.
Resolutions: 512, 768, 1024.

Sampling:

Sample Every = 250.
Sampler = FlowMatch.
Width/Height = 1024×1024.
Sample Steps = 8.
Guidance Scale = 1.
5–10 diverse prompts.

Train, watch the samples every 250 steps, and export whichever checkpoint gives you the best trade‑off between likeness/style and generalisation.

6. Troubleshooting for Z‑Image LoRA training

"My LoRA destroyed Turbo—now I need many more steps and CFG."

Most often: you trained on Turbo without the training adapter, or you pushed LR too high (≥2e‑4) for too long. Turbo then "forgets" its distillation and drifts toward a normal 20–50‑step model.
Fix: make sure Model Architecture uses the training‑adapter variant, keep LR ≤0.0001, and consider slightly fewer steps (2000–2500) on powerful GPUs.

"The style is way too strong; everything becomes my style even without the trigger."

Lower LoRA strength at inference (e.g. 0.6–0.8 instead of 1.0).
Consider enabling DOP with a trigger word for production LoRAs. That tells the model "only change things when this trigger appears; otherwise behave like base Z‑Image."
You can also tilt Timestep Bias toward High Noise to make the LoRA more about global tone and less about fine details, so identity doesn’t morph as much.

"The character looks right but hands / backgrounds are messy."

This usually means your dataset under‑represents those cases. Add a few more images with clear hands / complex backgrounds and captions that mention them.
For highly detail‑sensitive characters, you can cautiously experiment with Timestep Bias = Favor Low Noise to push training more toward later detail‑oriented steps.

"Training is too slow or runs out of VRAM."

Lower resolution buckets (disable 1280/1536, keep 512–1024).
Turn Low VRAM and Layer Offloading ON.
Switch Transformer / Text Encoder quantization to float8.
Turn Cache Latents ON and, if captions are static, Cache Text Embeddings ON with Caption Dropout Rate = 0.

On RunComfy cloud AI Toolkit) you can also simply pick a larger GPU; Z‑Image’s 6B size is modest compared with 14B video models covered in other AI Toolkit guides.

7. Export and use your Z-Image Turbo LoRA

Once training is complete, you can use your Z-Image Turbo LoRA in two simple ways:

Model playground – open the Z-Image Turbo LoRA playground and paste the URL of your trained LoRA to quickly see how it behaves on top of the base model.
ComfyUI workflows – start a ComfyUI instance and build your own workflow or load one like Z-Image, add your LoRA and fine‑tune the LoRA weight and other settings for more detailed control.

OstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Advanced

Datasets

Dataset 1

Sample

Table of contents