Z‑Image (Base) LoRA training with Ostris AI Toolkit

Z‑Image (Base) is the full Z‑Image checkpoint (not the 8‑step Turbo). It’s designed for high‑quality text‑to‑image with CFG + negative prompts and more sampling steps, and it’s also the best choice if your goal is a clean, fully controllable LoRA (character, style, product, typography-heavy concepts).

By the end of this guide, you’ll be able to:

Train a Z‑Image LoRA in AI Toolkit by Ostris (local or cloud).
Pick defaults that actually match Z‑Image Base inference behavior (steps + CFG + resolution).
Avoid the most common Z‑Image Base gotchas (Turbo settings, “LoRA does nothing”, Base↔Turbo mismatch).
Export checkpoints you can use right away in your inference UI.

This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this guide:

https://www.runcomfy.com/trainer/ai-toolkit/getting-started

1. Z‑Image overview: what it can do (and how it differs from Turbo)
2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy
3. Hardware & VRAM requirements for Z‑Image Base LoRA
4. Building a Z‑Image LoRA training dataset
5. Step‑by‑step: train a Z‑Image LoRA in AI Toolkit
6. Recommended Z‑Image LoRA configs by VRAM tier
7. Common Z‑Image training issues and how to fix them
8. Using your Z‑Image LoRA after training

1. Z‑Image overview: what it can do (and how it differs from Turbo)

1.1 What “Z‑Image Base” means

“Z‑Image Base” refers to the non‑distilled Z‑Image checkpoint. In practice:

It expects more sampling steps (think ~30–50, not 8).
It uses CFG and negative prompts effectively.
It’s the better target for LoRA fine‑tuning when you want maximum control and quality.

1.2 Base vs Turbo (the important training implication)

A frequent mistake is training (or evaluating) Base like Turbo.

Turbo settings (8 steps, low/no CFG) will make Base outputs look under‑baked and can make you think your LoRA “isn’t working”.
Base settings (30–50 steps + normal CFG) are the correct way to judge checkpoints.

Rule of thumb:

If you trained a Base LoRA, evaluate it on Base with Base‑style sampling.

2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy

You can run AI Toolkit in two ways for this tutorial:

Local AI Toolkit (your own GPU)
Install AI Toolkit from the GitHub repo, then run the Web UI. Local training is ideal if you have an NVIDIA GPU, you’re comfortable managing CUDA/drivers, and you want a persistent setup for repeated LoRA iteration.

https://github.com/ostris/ai-toolkit
Cloud AI Toolkit on RunComfy (H100 / H200)
AI Toolkit runs in the browser on large GPUs:

No installs (just open the UI)
Big VRAM for higher resolution buckets (1280 / 1536)
Persistent workspace for datasets, configs, and past runs

The workflow is the same in both environments; only the GPU location changes.

3. Hardware & VRAM requirements for Z‑Image LoRA

Z‑Image can run on relatively modest GPUs for inference, but LoRA training still scales strongly with:

Resolution bucket (768 vs 1024 vs 1536)
Quantization (float8)
LoRA rank
Sampling settings during training (preview resolution + preview steps)

A practical way to think about it:

12–16GB VRAM: doable at 512/768 with careful settings
24GB VRAM: comfortable for 1024 LoRA training
48GB+ VRAM: easiest path for 1280/1536 buckets and faster iteration

If your goal is typography-heavy or product fidelity, plan for higher resolution and accept that VRAM needs rise quickly.

4. Building a Z‑Image Base LoRA training dataset

Z‑Image Base isn’t “special” about dataset formats — but it is sensitive to how you evaluate quality. So your dataset should be designed to match the behavior you want at inference time (CFG + higher steps).

4.1 Choose your goal (and dataset shape)

Character / likeness: 15–50 images
Mix close‑ups + mid shots + lighting variety.
Style: 30–200 images
Maximize subject variety so the model learns “style cues”, not one scene.
Product / concept: 20–80 images
Consistent framing and clear captions for defining features (materials, label text, shape).

4.2 Captions + trigger (keep it simple)

Use a trigger if you want a clean “on/off” switch (recommended for character/product).
Keep captions short and consistent. Long captions increase accidental binding (haircut/background becomes “part of the trigger”).

Quick templates

Character:
[trigger]

or photo of [trigger], portrait, natural lighting
Style:
in a [style] illustration style, soft shading, muted palette
Product:
product photo of [trigger], studio lighting, clean background

5. Step‑by‑step: train a Z‑Image LoRA in AI Toolkit

This section is written to match the AI Toolkit UI panels you see when creating a new job.

5.1 JOB panel (Training Name, GPU ID, Trigger Word)

Training Name: a descriptive run name (e.g., zimage_base_character_v1)
GPU ID: pick your GPU (local) or leave default (cloud)
Trigger Word (optional but recommended for character/product):
Example: zimgAlice

5.2 MODEL panel (Model Architecture, Name or Path, Options)

Model Architecture: choose Z‑Image
Name or Path: set the base model repo, typically:
Tongyi-MAI/Z-Image
Options

Low VRAM: ON if you’re on ≤ 24GB
Layer Offloading: OFF by default; turn ON only if you still OOM after lowering resolution/rank

5.3 QUANTIZATION panel (Transformer, Text Encoder)

Transformer: float8 (default) is a strong default for fitting larger buckets.
Text Encoder: float8 (default) if you need VRAM headroom.

If you have lots of VRAM, you can reduce quantization for simplicity — but float8 is usually a safe baseline.

5.4 TARGET panel (Target Type, Linear Rank)

Target Type: LoRA
Linear Rank (practical defaults)

16: style LoRAs, low VRAM runs
32: character/product LoRAs, higher fidelity
48+: only if you have lots of VRAM and you know you’re underfitting

5.5 SAVE panel (Data Type, Save Every, Max Step Saves to Keep)

Data Type: BF16
Save Every: 250 (enough checkpoints to pick the best one)
Max Step Saves to Keep: 4 (prevents disk bloat)

5.6 TRAINING panel (Batch Size, Steps, Optimizer, LR, Timesteps)

Stable baseline

Batch Size: 1
Gradient Accumulation: 1 (increase if you want a larger effective batch without VRAM)
Steps: see below (goal-based ranges)
Optimizer: AdamW8Bit
Learning Rate: 0.0001 (drop to 0.00005 if unstable)
Weight Decay: 0.0001
Timestep Type: Weighted
Timestep Bias: Balanced
Loss Type: Mean Squared Error
EMA: OFF for most LoRA runs

Steps: a Z‑Image Base‑friendly guideline

Z‑Image Base often tolerates longer training than distilled Turbo-style models, but you still want to stop before prompt fidelity collapses.

Character / likeness: 3000–7000 steps (common sweet spot depends on dataset size)
Style: 2000–6000 steps
Product / concept: 2500–6500 steps

If you want a quick “smoke test”, run 1000–1500 steps, check samples, then commit to a full run.

5.7 Text Encoder Optimizations + Regularization (right side)

Unload TE: keep OFF unless you know you want trigger-only behavior and no captions
Cache Text Embeddings: only enable if you use static captions and no caption dropout

Differential Output Preservation (DOP)

If your UI build includes it:

Enable Differential Output Preservation when you care about “LoRA only activates when prompted”
If DOP is ON, do NOT cache text embeddings (they conflict conceptually)

5.8 ADVANCED panel

Do Differential Guidance: leave OFF unless you already use it in your normal workflow and know what you’re tuning.

5.9 DATASETS panel (Target Dataset, Caption Dropout, Cache Latents, Resolutions)

Use the dataset settings exactly as you see them:

Target Dataset: select your dataset
Default Caption: optional short template (or leave blank if you use per-image .txt)
Caption Dropout Rate: 0.05 (turn to 0 if you cache text embeddings)
Cache Latents: ON for speed
Is Regularization: OFF for your main dataset
Flip X / Flip Y: OFF by default (especially for logos/text)
Resolutions (the most important lever)

Low VRAM: enable 512 + 768
24GB: enable 768 + 1024 (or 1024 only if dataset is consistent)
High VRAM: add 1280 / 1536 for best product/text fidelity

5.10 SAMPLE panel (this is where Base vs Turbo matters most)

This is the #1 place people misconfigure Z‑Image Base.

Recommended Base sampling defaults

Sample Every: 250
Sampler: FlowMatch (match the training scheduler family)
Guidance Scale: 4 (typical Base range is ~3–5; adjust by taste)
Sample Steps: 30–50 (start at 30)
Width / Height: match your main bucket (1024×1024 is a good baseline)
Add a small set of prompts that cover:

the trigger (if you use one)
different compositions
at least one “hard” prompt that stresses identity/style/product geometry

Optional negative prompt (Base supports it well)

Use a short negative prompt for previews to reduce artifacts, e.g.:

low quality, blurry, deformed, bad anatomy, watermark, text artifacts

5.11 Launch training & monitor

Start the job and watch:

Samples every checkpoint interval (250 steps)
Prompt fidelity (are prompts still respected?)
Overfit signals (same face/texture appears everywhere, backgrounds collapse)

Pick the checkpoint where the LoRA is strong without turning into an always-on filter.

6. Recommended Z‑Image Base LoRA configs by VRAM tier

Tier 1 — 12–16GB (tight VRAM)

Low VRAM: ON
Quantization: float8 for Transformer + Text Encoder
Linear Rank: 16
Resolutions: 512 + 768
Sample Steps: 30 (keep preview size at 768 if needed)
Steps: 2000–5000 depending on dataset size

Tier 2 — 24GB (most practical local tier)

Low VRAM: ON (you can try OFF once stable)
Quantization: float8
Linear Rank: 32 (character/product), 16–32 (style)
Resolutions: 768 + 1024 (or 1024 only if consistent)
Sample Steps: 30–40
Steps: 3000–7000 depending on goal

Tier 3 — 48GB+ (or cloud H100/H200)

Low VRAM: OFF (optional)
Quantization: optional (float8 still fine)
Linear Rank: 32–48
Resolutions: 1024 + 1280 + 1536 (if your dataset supports it)
Sample Steps: 40–50 for best preview quality
Steps: same goal-based ranges; you just iterate faster

7. Common Z‑Image Base training issues and how to fix them

These are Z‑Image Base–specific problems (not generic AI Toolkit errors).

“Base looks undercooked / low detail”

Likely cause: too few steps and/or too low resolution.

Fix

Increase sample steps to 40–50
Try a higher bucket (1280/1536) if your VRAM allows
If your inference workflow has a “shift” parameter, some users report improved coherence with shift in the mid range (e.g., ~4–6). Use this only as a fine-tuning knob after steps/CFG are correct.

“My Base LoRA works on Base but not on Turbo”

This is expected in many cases:

Turbo is distilled and behaves differently (especially around CFG/negatives and “how strongly LoRAs bite”).

Fix

If you need Turbo deployment, consider training in a Turbo-focused workflow instead of assuming Base↔Turbo transfer will be 1:1.
For best results, train and deploy on the same family (Base→Base).

“Text/logos are inconsistent”

Z‑Image Base can do great typography, but it’s sensitive to resolution and sampling.

Fix

Train at 1024+ (and consider 1280/1536 if possible)
Use 40–50 sampling steps for evaluation
Avoid Flip X if text matters
Caption the key text feature consistently (don’t rely on the trigger to imply it)

8. Using your Z‑Image Base LoRA after training

Run LoRA — open the Z‑Image Run LoRA page. On this base‑model inference page, you can select a LoRA asset you trained on RunComfy or import a LoRA file you trained with AI Toolkit, then run inference via the playground or the API. RunComfy uses the same base model and the full AI Toolkit pipeline definition from your training config, so what you saw during training is what you get in inference—this tight training/inference alignment helps keep inference results consistent with your training samples. You can also deploy your LoRA model as a dedicated endpoint by using Deployments page

OstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Advanced

Datasets

Dataset 1

Sample

Z‑Image (Base) LoRA training with Ostris AI Toolkit

Table of contents

1. Z‑Image overview: what it can do (and how it differs from Turbo)

1.1 What “Z‑Image Base” means

1.2 Base vs Turbo (the important training implication)

2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy

3. Hardware & VRAM requirements for Z‑Image LoRA

4. Building a Z‑Image Base LoRA training dataset

4.1 Choose your goal (and dataset shape)

4.2 Captions + trigger (keep it simple)

5. Step‑by‑step: train a Z‑Image LoRA in AI Toolkit

5.1 JOB panel (Training Name, GPU ID, Trigger Word)

5.2 MODEL panel (Model Architecture, Name or Path, Options)

5.3 QUANTIZATION panel (Transformer, Text Encoder)

5.4 TARGET panel (Target Type, Linear Rank)

5.5 SAVE panel (Data Type, Save Every, Max Step Saves to Keep)

5.6 TRAINING panel (Batch Size, Steps, Optimizer, LR, Timesteps)

5.7 Text Encoder Optimizations + Regularization (right side)

5.8 ADVANCED panel

5.9 DATASETS panel (Target Dataset, Caption Dropout, Cache Latents, Resolutions)

5.10 SAMPLE panel (this is where Base vs Turbo matters most)

5.11 Launch training & monitor

6. Recommended Z‑Image Base LoRA configs by VRAM tier

Tier 1 — 12–16GB (tight VRAM)

Tier 2 — 24GB (most practical local tier)

Tier 3 — 48GB+ (or cloud H100/H200)

7. Common Z‑Image Base training issues and how to fix them

“Base looks undercooked / low detail”

“My Base LoRA works on Base but not on Turbo”

“Text/logos are inconsistent”

8. Using your Z‑Image Base LoRA after training

More AI Toolkit LoRA training guides