Z‑Image (Base) LoRA training with Ostris AI Toolkit
Z‑Image (Base) is the full Z‑Image checkpoint (not the 8‑step Turbo). It’s designed for high‑quality text‑to‑image with CFG + negative prompts and more sampling steps, and it’s also the best choice if your goal is a clean, fully controllable LoRA (character, style, product, typography-heavy concepts).
By the end of this guide, you’ll be able to:
- Train a Z‑Image LoRA in AI Toolkit by Ostris (local or cloud).
- Pick defaults that actually match Z‑Image Base inference behavior (steps + CFG + resolution).
- Avoid the most common Z‑Image Base gotchas (Turbo settings, “LoRA does nothing”, Base↔Turbo mismatch).
- Export checkpoints you can use right away in your inference UI.
This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this guide:
https://www.runcomfy.com/trainer/ai-toolkit/getting-started
Table of contents
- 1. Z‑Image overview: what it can do (and how it differs from Turbo)
- 2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy
- 3. Hardware & VRAM requirements for Z‑Image Base LoRA
- 4. Building a Z‑Image LoRA training dataset
- 5. Step‑by‑step: train a Z‑Image LoRA in AI Toolkit
- 6. Recommended Z‑Image LoRA configs by VRAM tier
- 7. Common Z‑Image training issues and how to fix them
- 8. Using your Z‑Image LoRA after training
1. Z‑Image overview: what it can do (and how it differs from Turbo)
1.1 What “Z‑Image Base” means
“Z‑Image Base” refers to the non‑distilled Z‑Image checkpoint. In practice:
- It expects more sampling steps (think ~30–50, not 8).
- It uses CFG and negative prompts effectively.
- It’s the better target for LoRA fine‑tuning when you want maximum control and quality.
1.2 Base vs Turbo (the important training implication)
A frequent mistake is training (or evaluating) Base like Turbo.
- Turbo settings (8 steps, low/no CFG) will make Base outputs look under‑baked and can make you think your LoRA “isn’t working”.
- Base settings (30–50 steps + normal CFG) are the correct way to judge checkpoints.
Rule of thumb:
If you trained a Base LoRA, evaluate it on Base with Base‑style sampling.
2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy
You can run AI Toolkit in two ways for this tutorial:
- Local AI Toolkit (your own GPU)
Install AI Toolkit from the GitHub repo, then run the Web UI. Local training is ideal if you have an NVIDIA GPU, you’re comfortable managing CUDA/drivers, and you want a persistent setup for repeated LoRA iteration.
https://github.com/ostris/ai-toolkit
- Cloud AI Toolkit on RunComfy (H100 / H200)
AI Toolkit runs in the browser on large GPUs:
- No installs (just open the UI)
- Big VRAM for higher resolution buckets (1280 / 1536)
- Persistent workspace for datasets, configs, and past runs
The workflow is the same in both environments; only the GPU location changes.
3. Hardware & VRAM requirements for Z‑Image LoRA
Z‑Image can run on relatively modest GPUs for inference, but LoRA training still scales strongly with:
- Resolution bucket (768 vs 1024 vs 1536)
- Quantization (float8)
- LoRA rank
- Sampling settings during training (preview resolution + preview steps)
A practical way to think about it:
- 12–16GB VRAM: doable at 512/768 with careful settings
- 24GB VRAM: comfortable for 1024 LoRA training
- 48GB+ VRAM: easiest path for 1280/1536 buckets and faster iteration
If your goal is typography-heavy or product fidelity, plan for higher resolution and accept that VRAM needs rise quickly.
4. Building a Z‑Image Base LoRA training dataset
Z‑Image Base isn’t “special” about dataset formats — but it is sensitive to how you evaluate quality. So your dataset should be designed to match the behavior you want at inference time (CFG + higher steps).
4.1 Choose your goal (and dataset shape)
- Character / likeness: 15–50 images
Mix close‑ups + mid shots + lighting variety.
- Style: 30–200 images
Maximize subject variety so the model learns “style cues”, not one scene.
- Product / concept: 20–80 images
Consistent framing and clear captions for defining features (materials, label text, shape).
4.2 Captions + trigger (keep it simple)
- Use a trigger if you want a clean “on/off” switch (recommended for character/product).
- Keep captions short and consistent. Long captions increase accidental binding (haircut/background becomes “part of the trigger”).
Quick templates
- Character:
[trigger]or
photo of [trigger], portrait, natural lighting - Style:
in a [style] illustration style, soft shading, muted palette - Product:
product photo of [trigger], studio lighting, clean background
5. Step‑by‑step: train a Z‑Image LoRA in AI Toolkit
This section is written to match the AI Toolkit UI panels you see when creating a new job.
5.1 JOB panel (Training Name, GPU ID, Trigger Word)
- Training Name: a descriptive run name (e.g.,
zimage_base_character_v1) - GPU ID: pick your GPU (local) or leave default (cloud)
- Trigger Word (optional but recommended for character/product):
Example:
zimgAlice
5.2 MODEL panel (Model Architecture, Name or Path, Options)
- Model Architecture: choose Z‑Image
- Name or Path: set the base model repo, typically:
Tongyi-MAI/Z-Image - Options
- Low VRAM: ON if you’re on ≤ 24GB
- Layer Offloading: OFF by default; turn ON only if you still OOM after lowering resolution/rank
5.3 QUANTIZATION panel (Transformer, Text Encoder)
- Transformer:
float8 (default)is a strong default for fitting larger buckets. - Text Encoder:
float8 (default)if you need VRAM headroom.
If you have lots of VRAM, you can reduce quantization for simplicity — but float8 is usually a safe baseline.
5.4 TARGET panel (Target Type, Linear Rank)
- Target Type:
LoRA - Linear Rank (practical defaults)
- 16: style LoRAs, low VRAM runs
- 32: character/product LoRAs, higher fidelity
- 48+: only if you have lots of VRAM and you know you’re underfitting
5.5 SAVE panel (Data Type, Save Every, Max Step Saves to Keep)
- Data Type:
BF16 - Save Every:
250(enough checkpoints to pick the best one) - Max Step Saves to Keep:
4(prevents disk bloat)
5.6 TRAINING panel (Batch Size, Steps, Optimizer, LR, Timesteps)
Stable baseline
- Batch Size:
1 - Gradient Accumulation:
1(increase if you want a larger effective batch without VRAM) - Steps: see below (goal-based ranges)
- Optimizer:
AdamW8Bit - Learning Rate:
0.0001(drop to0.00005if unstable) - Weight Decay:
0.0001 - Timestep Type:
Weighted - Timestep Bias:
Balanced - Loss Type:
Mean Squared Error - EMA: OFF for most LoRA runs
Steps: a Z‑Image Base‑friendly guideline
Z‑Image Base often tolerates longer training than distilled Turbo-style models, but you still want to stop before prompt fidelity collapses.
- Character / likeness: 3000–7000 steps (common sweet spot depends on dataset size)
- Style: 2000–6000 steps
- Product / concept: 2500–6500 steps
If you want a quick “smoke test”, run 1000–1500 steps, check samples, then commit to a full run.
5.7 Text Encoder Optimizations + Regularization (right side)
- Unload TE: keep OFF unless you know you want trigger-only behavior and no captions
- Cache Text Embeddings: only enable if you use static captions and no caption dropout
Differential Output Preservation (DOP)
If your UI build includes it:
- Enable Differential Output Preservation when you care about “LoRA only activates when prompted”
- If DOP is ON, do NOT cache text embeddings (they conflict conceptually)
5.8 ADVANCED panel
- Do Differential Guidance: leave OFF unless you already use it in your normal workflow and know what you’re tuning.
5.9 DATASETS panel (Target Dataset, Caption Dropout, Cache Latents, Resolutions)
Use the dataset settings exactly as you see them:
- Target Dataset: select your dataset
- Default Caption: optional short template (or leave blank if you use per-image
.txt) - Caption Dropout Rate:
0.05(turn to0if you cache text embeddings) - Cache Latents: ON for speed
- Is Regularization: OFF for your main dataset
- Flip X / Flip Y: OFF by default (especially for logos/text)
- Resolutions (the most important lever)
- Low VRAM: enable 512 + 768
- 24GB: enable 768 + 1024 (or 1024 only if dataset is consistent)
- High VRAM: add 1280 / 1536 for best product/text fidelity
5.10 SAMPLE panel (this is where Base vs Turbo matters most)
This is the #1 place people misconfigure Z‑Image Base.
Recommended Base sampling defaults
- Sample Every:
250 - Sampler:
FlowMatch(match the training scheduler family) - Guidance Scale:
4(typical Base range is ~3–5; adjust by taste) - Sample Steps: 30–50 (start at 30)
- Width / Height: match your main bucket (1024×1024 is a good baseline)
- Add a small set of prompts that cover:
- the trigger (if you use one)
- different compositions
- at least one “hard” prompt that stresses identity/style/product geometry
Optional negative prompt (Base supports it well)
Use a short negative prompt for previews to reduce artifacts, e.g.:
low quality, blurry, deformed, bad anatomy, watermark, text artifacts
5.11 Launch training & monitor
Start the job and watch:
- Samples every checkpoint interval (250 steps)
- Prompt fidelity (are prompts still respected?)
- Overfit signals (same face/texture appears everywhere, backgrounds collapse)
Pick the checkpoint where the LoRA is strong without turning into an always-on filter.
6. Recommended Z‑Image Base LoRA configs by VRAM tier
Tier 1 — 12–16GB (tight VRAM)
- Low VRAM: ON
- Quantization: float8 for Transformer + Text Encoder
- Linear Rank: 16
- Resolutions: 512 + 768
- Sample Steps: 30 (keep preview size at 768 if needed)
- Steps: 2000–5000 depending on dataset size
Tier 2 — 24GB (most practical local tier)
- Low VRAM: ON (you can try OFF once stable)
- Quantization: float8
- Linear Rank: 32 (character/product), 16–32 (style)
- Resolutions: 768 + 1024 (or 1024 only if consistent)
- Sample Steps: 30–40
- Steps: 3000–7000 depending on goal
Tier 3 — 48GB+ (or cloud H100/H200)
- Low VRAM: OFF (optional)
- Quantization: optional (float8 still fine)
- Linear Rank: 32–48
- Resolutions: 1024 + 1280 + 1536 (if your dataset supports it)
- Sample Steps: 40–50 for best preview quality
- Steps: same goal-based ranges; you just iterate faster
7. Common Z‑Image Base training issues and how to fix them
These are Z‑Image Base–specific problems (not generic AI Toolkit errors).
“Base looks undercooked / low detail”
Likely cause: too few steps and/or too low resolution.
Fix
- Increase sample steps to 40–50
- Try a higher bucket (1280/1536) if your VRAM allows
- If your inference workflow has a “shift” parameter, some users report improved coherence with shift in the mid range (e.g., ~4–6). Use this only as a fine-tuning knob after steps/CFG are correct.
“My Base LoRA works on Base but not on Turbo”
This is expected in many cases:
- Turbo is distilled and behaves differently (especially around CFG/negatives and “how strongly LoRAs bite”).
Fix
- If you need Turbo deployment, consider training in a Turbo-focused workflow instead of assuming Base↔Turbo transfer will be 1:1.
- For best results, train and deploy on the same family (Base→Base).
“Text/logos are inconsistent”
Z‑Image Base can do great typography, but it’s sensitive to resolution and sampling.
Fix
- Train at 1024+ (and consider 1280/1536 if possible)
- Use 40–50 sampling steps for evaluation
- Avoid Flip X if text matters
- Caption the key text feature consistently (don’t rely on the trigger to imply it)
8. Using your Z‑Image Base LoRA after training
Run LoRA — open the Z‑Image Run LoRA page. On this base‑model inference page, you can select a LoRA asset you trained on RunComfy or import a LoRA file you trained with AI Toolkit, then run inference via the playground or the API. RunComfy uses the same base model and the full AI Toolkit pipeline definition from your training config, so what you saw during training is what you get in inference—this tight training/inference alignment helps keep inference results consistent with your training samples. You can also deploy your LoRA model as a dedicated endpoint by using Deployments page
More AI Toolkit LoRA training guides
- Z-Image-Turbo & De-Turbo LoRA training with AI Toolkit
- FLUX.2 Dev LoRA training with AI Toolkit
- Qwen-Image-Edit-2511 LoRA training with AI Toolkit
- Qwen-Image-Edit-2509 LoRA training with AI Toolkit
- Wan 2.2 I2V 14B image-to-video LoRA training
- Wan 2.2 T2V 14B text-to-video LoRA training
- Qwen Image 2512 LoRA training
- LTX-2 LoRA training with AI Toolkit
Ready to start training?

