Qwen-Image-Edit-2509 LoRA Training with AI Toolkit

Qwen‑Image‑Edit‑2509 is a 20B multi‑image edit model that can take up to three images at once (target, controls, and design) to perform precise, geometry‑aware edits. This guide shows you how to train practical LoRAs. By the end, you’ll be able to:

Train a Qwen-Image-Edit-2509 LoRA for reliable targeted edit tasks (e.g., putting any design onto a shirt) using AI Toolkit by Ostris.
Run the whole workflow either locally (even on <10GB VRAM using layer offloading), or in the browser with the cloud AI Toolkit on RunComfy on H100 / H200 (80GB / 141GB VRAM).
Understand why key parameters matter for this model: the Match Target Res and Low VRAM options, Transformer/Text Encoder quantization, Layer Offloading, Cache Text Embeddings, Differential Output Preservation, Differential Guidance, plus core hyper‑parameters like Batch Size, Steps, and LoRA Rank.
Confidently tune configs for your own edit LoRAs (relighting, clothing try‑on, skin, object replacements…).

This article is part of the AI Toolkit LoRA training series. If you’re new to Ostris AI Toolkit, start with the AI Toolkit LoRA training overview before diving into this FLUX.2 [dev] guide.

1. Qwen‑Image‑Edit‑2509 overview: what this edit model can do
2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy
3. Hardware & VRAM requirements for Qwen‑Image‑Edit‑2509 LoRA
4. Building a Qwen‑Image‑Edit‑2509 LoRA training dataset
5. Step‑by‑step: train a Qwen‑Image‑Edit‑2509 LoRA in AI Toolkit
6. Recommended Qwen‑Image‑Edit‑2509 LoRA configs by VRAM tier
7. Common Qwen‑Image‑Edit‑2509 training issues and how to fix them
8. Using your Qwen‑Image‑Edit‑2509 LoRA after training

1. Qwen‑Image‑Edit‑2509 overview: what this edit model can do

Qwen‑Image‑Edit‑2509 (often shortened to Qwen Edit 2509 or Qwen Image Edit Plus) is the September 2025 iteration of the Qwen‑Image‑Edit model. It is built on top of the 20B Qwen‑Image base, with official weights on the Qwen‑Image‑Edit‑2509 model page on Hugging Face.

Compared to the first Qwen‑Image‑Edit release, 2509 adds:

Multi‑image editing – the model can take 1–3 input images at once (e.g., person + clothing + pose, or source photo + lighting reference).
Image concatenation behaviour – in the official pipelines each input image is resized to about 1 megapixel and then processed together. The model effectively sees a fixed pixel budget even when you supply multiple controls.
Better text and detail editing – powered by Qwen2.5-VL and a dedicated VAE, it handles small text, logos, and fine details much better.

Typical LoRA use‑cases where people already use Qwen‑Image‑Edit‑2509 include:

Clothing try‑on / outfit swap – Qwen‑Image‑Edit‑2509‑Clothing‑Tryon‑LoRA.
Relighting / lighting refinement – Qwen‑Image‑Edit‑2509‑Relight‑LoRA.
Multi‑effect style & detail fusion – Qwen‑Image‑Edit‑2509‑Multi‑Effect‑Fusion‑LoRA.
Light restoration, plus turning white-background shots into full scenes – Qwen‑Image‑Edit‑2509‑White‑Film‑To‑Rendering‑LoRA.
Photo to anime stylization – Qwen‑Image‑Edit‑2509‑Anime‑Stylization‑LoRA.
Romantic / kissing pose editing – Qwen‑Image‑Edit‑2509‑Passionate‑Kiss‑LoRA.
Caricature / exaggerated portrait style – Qwen‑Image‑Edit‑2509‑Caricature‑LoRA.

Qwen‑Image‑Edit and Qwen‑Image share essentially the same base. Community tests show LoRAs trained on Qwen‑Image are compatible with Qwen‑Image‑Edit / 2509 and vice versa, because the adapters attach to the same backbone.

2. Environment options: local AI Toolkit vs cloud AI Toolkit on RunComfy

2.1 Local AI Toolkit (your own GPU)

Install AI Toolkit from the AI Toolkit GitHub repository, then run the Web UI. Local training is a good fit if you already have a 24GB+ NVIDIA card, you’re comfortable managing CUDA / drivers / disk space, and you don’t mind letting training run overnight.

2.2 Cloud AI Toolkit on RunComfy (H100 / H200)

With the cloud AI Toolkit on RunComfy, AI Toolkit runs entirely in the cloud:

You do not install anything – you just open a browser, log in, and you are in the AI Toolkit UI.
You have access to big GPUs like H100 (80GB) and H200 (141GB) for heavy Qwen‑Image‑Edit‑2509 LoRA runs.
You get a persistent workspace – datasets, configs, and past jobs stay attached to your account so you can come back and iterate.

👉 Open it here: Cloud AI Toolkit on RunComfy

The rest of this tutorial works identically in both environments; only the place where the GPU lives is different.

3. Hardware & VRAM requirements for Qwen‑Image‑Edit‑2509 LoRA

Qwen‑Image‑Edit‑2509 is a heavy model:

The base model is around 20B parameters.
The edit pipeline can feed up to 3 × ~1MP images through the transformer at once.

In the stock 32GB example config for 2509 (train_lora_qwen_image_edit_2509_32gb.yaml), users report roughly:

27–28.5GB VRAM for 1024×1024 training.
25–26GB VRAM for 768×768 training — still no luck for 24GB.

That’s why the official example is explicitly a 32GB config. But with 3‑bit ARA quantization + Low VRAM mode + Layer Offloading (RAMTorch), Ostris shows you can push Qwen‑Image‑Edit‑2509 down to ~8–9GB GPU VRAM, at the cost of high CPU RAM (60GB+) and slower training.

Tier	Where	Example hardware	What it looks like
Low VRAM (~10–12GB)	Local	RTX 3060 12GB, 4070, etc.	You must enable quantization in the QUANTIZATION panel (3‑bit ARA for the base model) and use aggressive Layer Offloading. Expect ~8–9GB GPU VRAM and 60GB+ CPU RAM, with ~10–12s/step on a mid‑range CPU. This comfortably trains this guide’s config (2 control streams) up to 1024×1024; treat 1024² as your practical max resolution on this tier.
Tight 24GB	Local	RTX 3090 / 4090 / 5090	24GB cannot fit the stock 32GB Qwen‑Edit LoRA config at 1024² with 2 controls and no offloading (it peaks around ~24.7GB VRAM), so you still need Low VRAM tricks such as 3‑bit ARA, gradient checkpointing, and/or partial offload. Treat 768×768 as the practical max target resolution with 2 controls unless you add some offloading.
Comfortable 32GB	Local	RTX 4090 32GB, newer cards	This is the tier the official `train_lora_qwen_image_edit_32gb.yaml` is tuned for: 3‑bit ARA quantization, 1024² resolution buckets, mid‑range LoRA rank, and no offloading. With 32GB you can treat 1024×1024 (with 2–3 control streams) as a normal working resolution.
High VRAM (80–141GB)	Cloud AI Toolkit on RunComfy	H100 80GB / H200 141GB	You can keep configs simple (quantization on, offloading off), use larger batches (4–8), and train at 1024×1024 by default without worrying about OOM. On this tier you can also experiment with slightly higher resolutions (e.g. 1280–1536px) if you accept higher VRAM use, but 1024² remains the safest, best‑tested target size.

On a 4090 with full offloading, Ostris’ example hits ~9GB VRAM and ~64GB CPU RAM, running ~5k steps in about a day. On a 5090 without offload, iterations are roughly 2–3× faster.

4. Building a Qwen‑Image‑Edit‑2509 LoRA training dataset

We’ll mirror the "shirt design" example from Ostris’ walkthrough and generalise it so you can adapt it to other tasks.

4.1 Three logical streams of images

For a clothing design LoRA, the model should learn: Given a person wearing a blank shirt and a design image, put this design on their shirt while preserving pose, lighting, and wrinkles.

Target images (what you want as result) – a person wearing a shirt with the design already on it. These are the outputs you want the model to reproduce.
Control images (blank shirts, same people) – the same subjects and poses as the targets, but without the design (or with a plain shirt). These control geometry, wrinkles, lighting, and occlusions (arms, hair, necklaces, etc.).
Design images – the design itself on a neutral background (gray, black, or white). You can include a few variants (different background colours) to increase robustness.

In Ostris’ example, around 26 triplets (person + blank shirt + design) were enough to get very strong performance, including QR codes and complex logos mapping correctly onto fabric. For production LoRAs, starting with 20–60 well‑curated triplets (target + control + design) is a good baseline.

4.2 Resolution & aspect ratio

Qwen‑Image‑Edit‑2509:

Resizes each input to about 1MP internally (e.g., 1024×1024 or equivalent).
Works best when your training images are either square or near‑square (we’ll use 1024×1024 here), or a consistent aspect ratio (e.g., all 3:4).

In this tutorial we assume square images so bucketing is simple:

Targets, controls, and designs all around 1024×1024. AI Toolkit will bucket into 512 / 768 / 1024 buckets depending on what you enable in the DATASETS panel.

4.3 Captions

For this clothing‑design LoRA, used no per‑image captions, only a single default caption at dataset level: put this design on their shirt

This works because:

The semantics are simple and identical across all samples.
The control and design images carry most of the interesting information.

For more complex edit LoRAs (like "relight like studio rim light" vs "golden hour"), you should use per‑image captions describing the desired edit.

5. Step‑by‑step: train a Qwen‑Image‑Edit‑2509 LoRA in AI Toolkit

5.1 Step 0 – Choose where you’ll run AI Toolkit

You can run AI Toolkit in two ways for this tutorial:

Local AI Toolkit (your own GPU) – install AI Toolkit, run the Web UI, and open it locally. Make sure you have an NVIDIA GPU with at least 10–12GB VRAM (24GB+ preferred) and enough CPU RAM (ideally 64GB+ if you plan to use Layer Offloading).
Cloud AI Toolkit on RunComfy – log into the cloud AI Toolkit on RunComfy. You land directly in the AI Toolkit UI running in the cloud. When you start a job from the Training Queue you pick an H100 (80GB) or H200 (141GB) machine.

5.2 Step 1 – Create datasets in AI Toolkit

In the AI Toolkit UI, open the Datasets tab.

Create three datasets (names are just examples):

shirt_target
shirt_control
shirt_design

Upload your images so each dataset has a clear role:

shirt_target – 20–60 photos of people wearing shirts with designs.
shirt_control – the same people and poses without designs (or with a blank shirt).
shirt_design – square design images on simple backgrounds (gray, black, or white).

If you don’t have captions prepared as .txt files, leave per‑image captions empty for now. We’ll add a single Default Caption at job level later.

Important pairing note

Target and control images should be paired in order (same person, same pose) as much as possible. To keep the pairing stable, use matching filenames across folders so alphabetical order lines up, for example: shirt_target/img_0001.jpg, shirt_control/img_0001.jpg, shirt_design/img_0001.png. Each target image should have a corresponding control and design image with the same index.

5.3 Step 2 – Create a new Job

Open the New Job tab. Let's configure each panel in the order they appear.

5.3.1 JOB panel – job name, GPU, trigger word

Training Name – set any descriptive name, for example qwen_edit2509_shirt_lora_v1. This becomes the job name and the folder name where checkpoints are saved.
GPU ID – on a local install, choose the GPU on your machine. In the cloud AI Toolkit on RunComfy, leave GPU ID at the default. The actual machine type (H100 / H200) is chosen later when you start the job from the Training Queue.
Trigger Word – enter the phrase you want to type at inference time, for example: put this design on their shirt. In your dataset captions you can use [trigger] as a placeholder. AI Toolkit replaces [trigger] with the Trigger Word during training. A clear trigger phrase gives you a clean on/off switch for the LoRA: prompts that do not contain it should stay close to the base Qwen‑Image‑Edit‑2509 behaviour, especially if you also enable Differential Output Preservation (DOP) as recommended later.

5.3.2 MODEL panel – base model & VRAM options

Model Architecture – select Qwen‑Image‑Edit‑2509.
Name or Path – lets you override the default Hugging Face / model hub path for Qwen/Qwen-Image-Edit-2509. Leave it blank or at the default value and AI Toolkit will download the recommended base model from Hugging Face. Or point it to a local path if you want to use a custom Qwen-Image-Edit checkpoint.

In Options:

Low VRAM – turn ON for GPUs with ≤ 24GB VRAM. This enables extra checkpointing and memory‑saving tricks inside the backbone so the large Qwen model fits more easily.
Match Target Res – turn ON for Qwen‑Image‑Edit jobs. This resizes control images to match the same resolution bucket as the target image (e.g., 768×768 or 1024×1024). It keeps edit geometry aligned and avoids wasting VRAM on oversized controls.
Layer Offloading – treat this as a safety valve. Turn it ON on very small GPUs if you still hit CUDA OOM after enabling Low VRAM and quantization; this will offload some layers to CPU RAM at the cost of slower steps. Leave it OFF on 24GB+ or cloud GPUs on RunComfy for best speed.

5.3.3 QUANTIZATION panel – fitting the big transformer

Qwen‑Image‑Edit‑2509 is large enough that quantization is almost always a good idea.

Transformer – set to float8 (default). In AI Toolkit this typically corresponds to a 3‑bit ARA base with an 8‑bit "recovery" adapter, so you get VRAM usage close to a 3‑bit model with quality close to full precision.
Text Encoder – set to float8 (default) as well. The text encoder is big, and running it in fp8 saves a lot of VRAM with minimal quality loss.

You do not need to manually configure ARA files in the UI; selecting the float8 options is enough.

5.3.4 TARGET panel – LoRA type and rank

This panel tells AI Toolkit that you’re training a LoRA and how much capacity it should have.

Target Type – choose LoRA.
Linear Rank – for Qwen‑Image‑Edit‑2509, 32 is a strong default. It is expressive enough for behaviours like "put this design on their shirt" but still light to train and load. On very small GPUs you can drop to 16; for more complex behaviours you can experiment with 48–64 (watch closely for overfitting at higher ranks).

5.3.5 SAVE panel – checkpoint type & frequency

Data Type – choose BF16. Qwen‑Image‑Edit‑2509 is typically run in bfloat16, and saving LoRA weights in BF16 keeps them compatible and reasonably small.
Save Every – 250 steps is a practical default; you’ll get a checkpoint every 250 training steps.
Max Step Saves to Keep – 4 keeps the last four checkpoints and automatically deletes older ones so your disk does not fill up.

5.3.6 TRAINING panel – core hyper‑parameters

The TRAINING panel controls how aggressively we fine‑tune Qwen‑Image‑Edit‑2509.

Recommended starting values for a single‑dataset LoRA (10–40 images at 768–1024px):

Batch Size – set this to 1 by default. Use 2 only on very large GPUs (A100 / H100 / H200 tier).
Gradient Accumulation – start at 1. If you want a larger effective batch size without more VRAM, increase this to 2–4. Effective batch size is Batch Size × Gradient Accumulation.
Steps – use 2500–3000. For the shirt‑design example with ~20–30 triplets, 3000 works well. If your dataset is tiny (<15 images), consider 1500–2200 to avoid overfitting.
Optimizer – choose AdamW8Bit. 8‑bit Adam dramatically reduces memory while behaving like standard AdamW.
Learning Rate – set 0.0001. If training looks noisy or unstable, reduce this to 0.00005.
Weight Decay – set 0.0001 as a mild regulariser so the LoRA does not drift too far on small datasets.
Timestep Type – set to Weighted. This biases training toward the noise levels that matter most for Qwen‑Image‑Edit.
Timestep Bias – set to Balanced, which is a safe default that doesn’t over‑emphasise very early or very late timesteps.
Loss Type – leave this at Mean Squared Error, the standard choice for diffusion / rectified‑flow‑style training.
EMA (Exponential Moving Average → Use EMA) – leave OFF for LoRAs. EMA is more useful when training full models.

5.3.7 Regularization & text‑encoder section (right side of TRAINING panel)

On the right side of the TRAINING panel you’ll see two important areas: Text Encoder Optimizations and Regularization.

Text Encoder Optimizations

Cache Text Embeddings – for Qwen‑Image‑Edit + Differential Output Preservation (DOP), this must stay OFF. DOP rewrites the prompt text internally every batch, so cached embeddings would no longer match the real prompts. When DOP is OFF and your captions are static, you can turn Cache Text Embeddings ON to encode all captions once, store the embeddings on disk, and then free the text encoder from VRAM.
Unload Text Encoder (Unload TE) – this is a special trigger‑only mode. When you turn it ON, AI Toolkit caches the embeddings for your Trigger Word and Sample prompts once, unloads the text encoder from VRAM, and ignores all dataset captions. For Qwen‑Image‑Edit‑2509 LoRAs that rely on normal captions (and especially when Differential Output Preservation is ON), you should leave Unload TE OFF.

Because caption dropout is implemented by randomly dropping captions during training, it relies on fresh text encoding each step. If you enable Cache Text Embeddings, you should set Caption Dropout Rate = 0 in the DATASETS panel (see below) so there is no mismatch between cached embeddings and the intended dropout behaviour.

Regularization → Differential Output Preservation

Differential Output Preservation – turn this ON for most real projects. It is crucial for Qwen‑Image‑Edit: it lets the base model behave normally when the trigger phrase is missing and only injects your behaviour when the trigger is present.
DOP Loss Multiplier – leave this at 1 to start. You can increase it slightly if you see too much style leaking into non‑trigger prompts.
DOP Preservation Class – use a neutral class word that describes what you edit most. For people‑focused edits, person is a good default; for product‑only edits, use something like product or object.

How DOP ties back to your captions and Trigger Word:

Suppose a caption is "[trigger] a person walking down the street, wearing the design on their shirt"
With Trigger Word = put this design on their shirt
And DOP Preservation Class = person

AI Toolkit internally creates two prompts:

put this design on their shirt a person walking down the street, wearing the design on their shirt – the LoRA path.
person a person walking down the street, wearing the design on their shirt – the base‑model path.

The LoRA is trained only on the difference between these two. Generations without the trigger phrase stay much closer to vanilla Qwen‑Image‑Edit‑2509 because DOP explicitly preserves that behaviour.

Blank Prompt Preservation – leave this OFF unless you have a very specific reason to preserve behaviour for empty prompts.

5.3.8 ADVANCED panel – Differential Guidance

Do Differential Guidance – turn this ON.
Differential Guidance Scale – start with 3.

Differential Guidance is an AI Toolkit‑specific trick that scales the error signal the LoRA sees. A larger scale makes the "you’re wrong here" signal louder so the LoRA usually learns the desired change faster without increasing the learning rate.

If samples look unstable or overly "sharp" early in training, lower this to 2. If learning feels very slow, you can experiment with 4 later.

5.3.9 DATASETS panel – wiring target, control and design images

For Qwen‑Image‑Edit‑2509 you must provide at least one target dataset and one control dataset.

Inside Dataset 1:

Target Dataset – choose your output / edited dataset, i.e. images that represent "after applying the LoRA behaviour".
Control Dataset 1 – choose the dataset containing your input images (the original photos you want to edit). Each file should match a target image by name (e.g., scene_001.png → scene_001.png).
Control Dataset 2 / 3 – these are optional. For the shirt LoRA, set Control Dataset 2 to shirt_design so the model sees the logo or artwork as a second control stream. Leave control slots empty unless you have extra conditions like depth maps or keypoints.
LoRA Weight – leave at 1 unless you add more datasets. When you do add more datasets you can rebalance their influence here.
Default Caption – if your images already have .txt captions, you can leave this empty. Otherwise enter something like:
"[trigger] put this design on their shirt, full‑body street photo"

Remember: [trigger] will be replaced by the Trigger Word from the JOB panel.
Caption Dropout Rate – 0.05 is a good starting value when you are not caching text embeddings; roughly one in twenty steps will ignore the caption so the model doesn’t overfit to exact wording. If you plan to turn Cache Text Embeddings ON in the TRAINING panel, set Caption Dropout Rate = 0, because dropout requires re‑encoding captions each step and does not work correctly with cached embeddings.
Settings → Cache Latents – turn this ON. AI Toolkit encodes each target image to VAE latents once and reuses them, which removes the heavy VAE from the GPU after caching and speeds up training significantly.
Settings → Is Regularization – leave this OFF for your main dataset. If you later add a second dataset purely for regularisation images (for example generic people photos), you would set that second dataset’s Is Regularization to ON.
Flipping (Flip X / Flip Y) – for most people / product LoRAs leave both OFF, unless you are sure mirror flips are safe for your subject (Flip X will mirror any text on shirts).
Resolutions – enable the buckets you want Qwen‑Image‑Edit to train at, for example 512, 768, and 1024. 768 is a sweet spot for many Qwen LoRAs; adding 512 and 1024 makes training robust to slight resolution changes.

You can add additional datasets with Add Dataset (e.g., a regularisation dataset with LoRA Weight < 1), but a single Dataset 1 with one target + one or two control sets is enough for most "put this design on their shirt" use cases.

5.3.10 SAMPLE panel – training previews

The SAMPLE panel controls periodic previews while training. These samples do not affect the training loss; they are only for monitoring.

Sample Every – set this to 250 so you generate previews every 250 steps, which lines up nicely with your checkpoint schedule.
Width / Height – match your main training resolution, for example 1024 × 1024 or 768 × 1024 depending on your dataset.
Seed – choose a stable seed such as 42. You can enable Walk Seed if you want each preview batch to use consecutive seeds and show more variety.
Sampler – choose FlowMatch (or the default Qwen sampler in your build). This should match the FlowMatch scheduler used in TRAINING.
Guidance Scale – set 4 for previews. When you do inference later in ComfyUI or other UIs, you’ll usually experiment between 3–6.
Sample Steps – around 25 steps is a good quality‑vs‑speed compromise for previews.
Advanced Sampling – you can leave Skip First Sample, Force First Sample, and Disable Sampling all OFF. Turn Disable Sampling ON only if you’re debugging or want maximum speed with no previews at all.
Sample Prompts – add 4–8 prompts that represent realistic use cases for your LoRA

5.4 Step 3 – Launch training & monitor

After you configure the job, go to the Training Queue tab, select your job, and get it ready to run.

Click Start / Play and mainly watch two things:

GPU VRAM / CPU RAM – especially on low‑VRAM cards using Layer Offloading, keep an eye on system RAM usage.
Sample images – the design should stay on the shirt and follow wrinkles and pose. If it starts bleeding into the whole image or colours become extreme, consider stopping early or reducing total steps.

6. Recommended Qwen‑Image‑Edit‑2509 LoRA configs by VRAM tier

If you just want one safe default for 24GB local GPUs and all H100/H200 cloud runs, use the settings from sections 3–6: Low VRAM = ON, Transformer/Text Encoder quantization = float8, Batch Size = 1, LoRA Rank = 32, Resolutions = 512 / 768 / 1024, Differential Output Preservation = ON, Cache Text Embeddings = OFF**.

Below are only the settings that really change with hardware. Anything not mentioned here (Steps, Learning Rate, Optimizer, etc.) can stay at the earlier recommendations.

Tier 1 – Low VRAM (~10–12GB local)

MODEL → Low VRAM: turn ON. This enables extra checkpointing and shuffling so Qwen‑Image‑Edit‑2509 fits on a 10–12GB card.
MODEL → Layer Offloading: turn ON if you still hit CUDA OOM. Expect high CPU RAM usage (≈60GB+) and slower steps, but GPU VRAM can drop to around 8–9GB.
QUANTIZATION → Transformer / Text Encoder: set both to float8. In this architecture that uses the Qwen 3‑bit ARA adapters under the hood, float8 is the practical minimum for stable quality.
TRAINING → Batch Size: lock this to 1. If you want a larger effective batch, increase Gradient Accumulation instead of Batch Size.
DATASETS → Resolutions: enable 512 and 768 as your main buckets. You can add 1024 if you accept slower, more fragile runs; treat 1024×1024 with two control streams as the practical upper limit on this tier.
TRAINING → Text Encoder Optimizations / Regularization: if you cannot fit Differential Output Preservation even with Low VRAM and Layer Offloading, turn DOP OFF and turn Cache Text Embeddings ON so captions are encoded once and the text encoder is freed from VRAM. You lose some base‑model preservation but gain several GB of headroom.

Tier 2 – Tight 24GB (3090 / 4090 / 5090‑class)

What you can relax compared to Tier 1:

MODEL → Low VRAM: keep ON for safety on 24GB; you can experiment with turning it OFF once you know your resolution and control setup fits comfortably.
MODEL → Layer Offloading: usually OFF. Only enable it if you still hit OOM at your chosen resolution and number of control streams.
QUANTIZATION → Transformer / Text Encoder: keep both at float8. Disabling quantization on this tier rarely helps and just burns VRAM you could spend on resolution or batch size.
TRAINING → Batch Size: 1 is still the default. Batch Size 2 is sometimes possible at 768×768 with two controls if Low VRAM is ON and quantization stays ON.
DATASETS → Resolutions: enable 512, 768, and 1024. Consider 768 your “always safe” bucket and 1024 the high‑end bucket that may need Low VRAM and possibly partial offload.
TRAINING → Text Encoder Optimizations / Regularization: you can usually keep Differential Output Preservation ON and Cache Text Embeddings OFF, especially if you train primarily at 768×768. If you absolutely need 1024×1024 on a 24GB card and still hit OOM after other tweaks, the next lever is to turn DOP OFF and turn Cache Text Embeddings ON.

Tier 3 – Comfortable 32GB+ local and cloud H100/H200

On 32GB local cards and 80–141GB cloud GPUs (H100 / H200), you stop fighting VRAM and can simplify the config:

MODEL → Low VRAM: optional. You can turn it OFF on 32GB+ local GPUs and H100/H200 for slightly faster steps and simpler traces.
MODEL → Layer Offloading: keep OFF. All Qwen‑Image‑Edit‑2509 components can stay resident on the GPU.
QUANTIZATION → Transformer / Text Encoder: leave both at float8 by default. On H100/H200 you can experiment with disabling Text Encoder quantization if you want, but it is not required for good quality and offers little benefit compared to using that VRAM for batch size or resolution.
TRAINING → Batch Size: use 1–2 on 32GB local GPUs, and 2–4 on H100/H200 at 1024×1024 with two control streams.
TARGET → LoRA Rank: 32 is a comfortable default. You can try 48–64 on H100/H200 for very complex behaviours (e.g., multi‑effect edit LoRAs) if you watch for overfitting.
DATASETS → Resolutions: train primarily at 768 and 1024. You can usually drop 512 unless you specifically care about low‑resolution behaviour.
TRAINING → Text Encoder Optimizations / Regularization: run with Differential Output Preservation ON and Cache Text Embeddings OFF as the default. VRAM is sufficient to keep the text encoder resident, and you get the cleanest separation between “with trigger” and “without trigger” behaviour.

7. Common Qwen‑Image‑Edit‑2509 training issues and how to fix them

7.1 Mis‑paired datasets (wrong order / mismatched people)

Symptom: designs appear, but on the wrong spot, wrong person, or warped.

Check that target and control datasets are aligned: shirt_target/img_0001.jpg should pair with shirt_control/img_0001.jpg, and shirt_design/img_0001.png, and so on. If you shuffle images manually, keep filenames paired so alphabetical order still lines up.

7.2 VRAM OOM even with quantization

If you train with a small target resolution (for example 512×512) but your control datasets still use 1024×1024 as their highest bucket and Match Target Res is turned OFF, each control stream will be encoded at 1024×1024 while the target is only 512×512. With two or three such control streams, the total latent size becomes much larger than expected and you can easily hit CUDA OOM even with quantization enabled.

To fix this:

Either turn Match Target Res ON in the MODEL panel so all control images are automatically resized to the same resolution bucket as the target (e.g. they all become 512×512 when the target sample is 512×512), or
Keep Match Target Res OFF but lower the highest resolution bucket for your control datasets to match the target (drop 1024 and stick to 512/768).

On H100/H200 in the cloud you can afford to keep 1024×1024 buckets for both target and controls and rely less on these tricks, but the safest rule is: avoid mixing tiny targets with very large controls when Match Target Res is disabled.

7.3 Training never converges / looks random

Check the following:

In the TRAINING panel the noise scheduler and timestep settings still correspond to FlowMatch. In the exported YAML you should see noise_scheduler: "flowmatch", and in the SAMPLE panel the sampler should also be set to FlowMatch; if the sampler uses a different scheduler, previews can look like pure noise even if the LoRA is training correctly.
The Learning Rate is not too high. 0.0001 is a safe default for Qwen‑Image‑Edit‑2509 LoRAs; if previews keep oscillating or look very unstable after a few hundred steps, drop it to 0.00005 and resume from the last good checkpoint.

7.4 LoRA overfits (design bleeds everywhere)

Possible fixes:

Reduce total Steps (e.g., from 5000 down to 3000).
Consider a slightly lower LoRA Rank (16 instead of 32).
Diversify the dataset with different people, poses, and lighting.
Ensure Differential Output Preservation is enabled and, if needed, increase the DOP Loss Multiplier a bit so the base behaviour is preserved more strongly.

7.5 Environment hell

Typical local issues include CUDA version mismatch, wrong PyTorch build, or drivers not matching your GPU / OS. In the cloud AI Toolkit on RunComfy these issues disappear: AI Toolkit and dependencies are pre‑installed and you start directly from the UI with configs and datasets.

If you find yourself spending more time fixing CUDA than training, that is usually the point where it’s easier to move this specific job to the cloud.

8. Using your Qwen‑Image‑Edit‑2509 LoRA after training

Once training is complete, you can use your Qwen‑Image‑Edit‑2509 LoRA in two simple ways:

Model playground – open the Qwen‑Image‑Edit‑2509 LoRA playground and paste the URL of your trained LoRA to quickly see how it behaves on top of the base model.
ComfyUI workflows – start a ComfyUI instance and either build your own workflow or load one like Qwen Edit 2509 MultipleAngles, swap in your LoRA in the LoRA loader node, and fine‑tune the LoRA weight and other settings for more detailed control.

OstrisAI-Toolkit

New Training Job

Job

Model

Quantization

Target

Save

Training

Advanced

Datasets

Dataset 1

Sample

Table of contents