Create Coherent Scenes (Qwen Image Edit & Wan 2.2)
Create Coherent Scenes (Qwen Image Edit & Wan 2.2) is a production‑ready ComfyUI workflow for building story‑driven, multi‑shot videos where characters, lighting, and composition remain consistent from shot to shot. It pairs Qwen Image Edit for precise, reference‑guided stills with Wan 2.2 image‑to‑video for cinematic motion, then lets you stitch scenes, smooth motion with frame interpolation, and add generated foley audio to finish. Ideal for narrative art, animation, previz, and concept reels, the workflow helps you move from a single establishing keyframe to a cohesive sequence with minimal hand‑retouching.
The pipeline is organized into three parts: Part 1 creates and edits coherent keyframes, Part 2 animates each shot with Wan 2.2 and joins them into one cut, and Part 3 generates scene‑aware foley audio. Everywhere you see Create Coherent Scenes (Qwen Image Edit & Wan 2.2) in this README, it refers to the complete, end‑to‑end process.
Key models in Comfyui Create Coherent Scenes (Qwen Image Edit & Wan 2.2) workflow
- Wan 2.2 Image‑to‑Video 14B (high‑noise and low‑noise variants). Core video generator used to animate your scene images while preserving spatial layout and style. Packaged for ComfyUI with text encoder and VAE assets. Reference: Comfy‑Org/Wan_2.2_ComfyUI_Repackaged.
- Qwen‑Image‑Edit 2509 + Qwen 2.5 VL text encoder + Qwen Image VAE. Semantic, reference‑aware image editing used to create next‑scene keyframes that match your narrative while keeping character and scene continuity. References: Comfy‑Org/Qwen‑Image‑Edit_ComfyUI and Comfy‑Org/Qwen‑Image_ComfyUI.
- FLUX.1 dev (text‑to‑image). Optional foundation model for the very first establishing keyframe before editing. Reference: Comfy‑Org/FLUX.1‑Krea‑dev_ComfyUI.
- RIFE Video Frame Interpolation. Used to boost frame rate and smooth motion on the combined cut. Reference: hzwer/Practical‑RIFE.
- HunyuanVideo‑Foley. A generative audio model that creates synchronized foley from images or video plus a short text cue; used to add diegetic sound per scene or for the final cut. Reference: phazei/HunyuanVideo‑Foley.
- Optional helpers. MiniCPM‑V 4.5 can auto‑draft audio prompts from your cut to speed up foley ideation: OpenBMB/MiniCPM‑V.
How to use Comfyui Create Coherent Scenes (Qwen Image Edit & Wan 2.2) workflow
Overall logic
- Part 1 creates an establishing keyframe and then uses Qwen Image Edit to generate “next scene” stills that remain stylistically aligned.
- Part 2 animates each scene image into a short clip with Wan 2.2, then concatenates all clips into a single cut and optionally interpolates frames for smoother motion.
- Part 3 optionally generates foley audio per scene or for the combined cut and muxes it into the final video.
Model loader
- The model area loads Wan 2.2 high‑ and low‑noise variants and their VAE/CLIP once, with an option to accelerate via torch compile. You will also see a low‑VRAM route using quantized GGUF UNETs and block‑swap so you can run the same Create Coherent Scenes (Qwen Image Edit & Wan 2.2) process on smaller GPUs.
- LoRAs for Wan 2.2 and the Qwen Image Edit Lightning LoRA are prewired to influence motion style and editing speed without complicating the graph.
- If you change models, keep the text encoder/UNET/VAE families consistent to avoid latent‑space mismatches.
Settings
- Global controls set the working width, height, seed, and scene length so every scene inherits identical canvas geometry and temporal cadence. This is one key to Create Coherent Scenes (Qwen Image Edit & Wan 2.2) consistency.
- A comprehensive negative prompt is provided and routed globally; you can override it any time to fit your art direction.
Part 1 — Text‑to‑Image establishing keyframe
- Start by describing your opening shot. The prompt feeds a base text‑to‑image sampler that outputs a “Start_” frame for the project.
- That image is cached and becomes the reference for the next scene in the Qwen track. The workflow scales the image to an editing‑friendly resolution and encodes it to latents.
Part 1 — Qwen Image Edit next‑scene keyframes
- For each subsequent shot, write a short “Next Scene” instruction. The editor conditions on the prior scene image so character identity, wardrobe, lighting, and palette stay aligned.
- The edited result is decoded, previewed, and saved as “Scene_1_…”, “Scene_2_…”, etc. These are your coherent stills. They also get stored into shared image slots so later prompts can reference them.
Scene inputs (1–6)
- If you already have concept frames, drop them into the six “LoadImage” nodes. Otherwise, use the Qwen‑generated stills from Part 1 as your start images.
- For each scene, add a short text prompt via the labeled prompt node. Think of these as cinematography notes that guide motion style rather than re‑describing the entire environment.
Scene sampling (1–6)
- Each scene runs a Wan 2.2 image‑to‑video pass to turn the start image into a latent clip. A three‑stage sampler path then refines the latent sequence using a high‑noise path, a low‑noise path, and a no‑LoRA path arranged for stability.
- The decoded frames feed a per‑scene video writer that saves an MP4 for quick review. Memory purge nodes after each render free VRAM before the next scene begins.
- Because all scenes share the same seed, size, and length, motion cadence and composition remain aligned, helping Create Coherent Scenes (Qwen Image Edit & Wan 2.2) feel like one continuous piece.
Combine scenes
- The six rendered image sequences are concatenated in order, producing a “Combined” cut. You can reorder or omit scenes by rewiring the batch node that collects them.
Optional frame interpolation
- An interpolation pass increases apparent frame rate using RIFE. This creates an “Interpolated” export for smoother camera and subject motion while retaining the same look.
Part 3 — Video‑to‑Audio foley
- Load the combined cut or any individual scene into the audio section. A built‑in vision‑language helper can auto‑draft a textual scene description; edit it to taste to reflect rhythm, mood, and key actions.
- The foley model synthesizes synchronized audio and a mux node combines it with your frames into an audio‑enabled MP4. For best results, generate audio per scene and then stitch.
Key nodes in Comfyui Create Coherent Scenes (Qwen Image Edit & Wan 2.2) workflow
-
WanImageToVideo (#111)
Converts a single reference frame into a coherent latent video while honoring positive and negative text. Use it to set each shot’s duration and canvas size and to supply the start image you want animated. Backed by Wan 2.2 I2V 14B models packaged here: Comfy‑Org/Wan_2.2_ComfyUI_Repackaged.
-
TextEncodeQwenImageEditPlus (#360)
Encodes “Next Scene” instructions together with a reference image so edits follow the story yet match identity and lighting. Keep nouns and stylistic tags consistent across scenes to reinforce continuity. Model references: Comfy‑Org/Qwen‑Image‑Edit_ComfyUI and Comfy‑Org/Qwen‑Image_ComfyUI.
-
KSamplerAdvanced (#159)
The core denoiser for each animated scene. This workflow chains three samplers that target different noise regimes and LoRA mixes to improve temporal stability. If you change steps or seeds, do so uniformly across the chained samplers to keep motion behavior predictable.
-
ImageBatchMulti (#308)
Gathers scene frame batches into one long timeline. Use it to reorder, drop, or swap scenes before export without touching the sampling paths.
-
RIFE VFI (#94)
Performs frame interpolation to increase perceived frame rate. It is especially effective for slow camera moves and fluid subject motion. Reference: hzwer/Practical‑RIFE.
-
HunyuanFoleySampler (#331)
Generates synchronized foley from frames plus a short text prompt, then passes audio to the video muxer. For model details and files, see phazei/HunyuanVideo‑Foley.
- For fastest iteration, use the quantized GGUF Wan 2.2 route with block‑swap when VRAM is tight; switch back to full precision for final renders.
- Keep width, height, and scene length identical across the whole project to reinforce rhythm and framing continuity.
- In Qwen prompts, preserve core identifiers (names, outfit, props) and lighting terms; vary only the action and camera language between scenes.
- Use the global seed to lock the project’s overall “feel”. Change it only when you want a different motion character across all scenes.
- Interpolate only after you are happy with timing, then render the audio version per scene and combine; per‑scene foley tends to sound more natural.
- FLUX.1 dev is a great base for the very first keyframe; once established, rely on Qwen edits to progress the story while keeping the look: Comfy‑Org/FLUX.1‑Krea‑dev_ComfyUI.
Acknowledgements
This workflow implements and builds upon the following works and resources. We gratefully acknowledge the creators of Qwen Image Edit for the model, the developers of Wan 2.2 for the model, and the author(@Benji’s AI Playground) of the “Create Coherent Scenes (Qwen Image Edit & Wan 2.2) Youtube Tutorial” for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.
Resources
- YouTube/Create Coherent Scenes (Qwen Image Edit & Wan 2.2)
Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.