LTX-2 ComfyUI Workflow | Real-Time Video Generation Speed

LTX-2 ComfyUI: real-time text, image, depth and pose to video with synchronized audio

This all-in-one LTX-2 ComfyUI workflow lets you generate and iterate on short videos with audio in seconds. It ships with routes for text to video (T2V), image to video (I2V), depth to video, pose to video, and canny to video, so you can start from a prompt, a still, or structured guidance and keep the same creative loop.

Built around LTX-2’s low-latency AV pipeline and multi-GPU sequence parallelism, the graph emphasizes fast feedback. Describe motion, camera, look, and sound once, then adjust width, height, frame count, or control LoRAs to refine the result without re-wiring anything.

Note: Note on LTX-2 Workflow Compatibility — LTX-2 includes 5 workflows: Text-to-Video and Image-to-Video run on all machine types, while Depth to Video, Canny to Video, and Pose to Video require a 2X-Large machine or larger; running these ControlNet workflows on smaller machines may result in errors.

Key models in the LTX-2 ComfyUI workflow

LTX-2 19B (dev FP8) checkpoint. Core audio-visual generative model that produces video frames and synchronized audio from multimodal conditioning. Lightricks/LTX-2
LTX-2 19B Distilled checkpoint. Lighter, faster variant useful for quick drafts or canny-controlled runs. Lightricks/LTX-2
Gemma 3 12B IT text encoder. Primary text understanding backbone used by the workflow’s prompt encoders. Comfy-Org/ltx-2 split files
LTX-2 Spatial Upscaler x2. Latent upsampler that doubles spatial detail mid-graph for cleaner outputs. Lightricks/LTX-2
LTX-2 Audio VAE. Encodes and decodes audio latents so sound can be generated and muxed alongside video. Included with the LTX-2 release above.
Lotus Depth D v1‑1. Depth UNet used to derive robust depth maps from images before depth-guided video generation. Comfy‑Org/lotus
SD VAE (MSE, EMA pruned). VAE used in the depth preprocessor branch. stabilityai/sd-vae-ft-mse-original
Control LoRAs for LTX‑2. Optional, plug‑and‑play LoRAs to steer motion and structure:
- Distilled LoRA 384 (general refinement) link
- Camera Control: Dolly Left link
- Pose Control link
- Depth Control link
- Canny Control link

How to use the LTX-2 ComfyUI workflow

The graph contains five routes that you can run independently. All routes share the same export path and use the same prompt-to-conditioning logic, so once you learn one, the others feel familiar.

T2V: generate video and audio from a prompt

The T2V path starts with CLIP Text Encode (Prompt) (#3) and an optional negative in CLIP Text Encode (Prompt) (#4). LTXVConditioning (#22) binds your text and the chosen frame rate to the model. EmptyLTXVLatentVideo (#43) and LTX LTXV Empty Latent Audio (#26) create video and audio latents that are fused by LTX LTXV Concat AV Latent (#28). The denoising loop runs through LTXVScheduler (#9) and SamplerCustomAdvanced (#41), after which VAE Decode (#12) and LTX LTXV Audio VAE Decode (#14) produce frames and audio. Video Combine 🎥🅥🅗🅢 (#15) saves an H.264 MP4 with synchronized sound.

I2V: animate a still

Load a still image with LoadImage (#98) and resize with ResizeImageMaskNode (#99). Inside the T2V subgraph, LTX LTXV Img To Video Inplace injects the first frame into the latent sequence so motion builds from your still rather than pure noise. Keep your textual prompt focused on motion, camera, and ambience; the content comes from the image.

Depth to video: structure‑aware motion from depth maps

Use the “Image to Depth Map (Lotus)” preprocessor to transform an input into a depth image, decoded by VAEDecode and optionally inverted for correct polarity. The “Depth to Video (LTX 2.0)” route then feeds depth guidance through LTX LTXV Add Guide so the model respects global scene structure as it animates. The path reuses the same scheduler, sampler, and upscaler stages, and ends with tiled decode to images and muxed audio for export.

Pose to video: drive motion from human pose

Import a clip with VHS_LoadVideo (#198); DWPreprocessor (#158) estimates human pose reliably across frames. The “Pose to Video (LTX 2.0)” subgraph combines your prompt, the pose conditioning, and an optional Pose Control LoRA to keep limbs, orientation, and beats consistent while allowing style and background to flow from the text. Use this for dance, simple stunts, or talk‑to‑camera shots where body timing matters.

Canny to video: edge‑faithful animation and distilled speed mode

Feed frames to Canny (#169) to get a stable edge map. The “Canny to Video (LTX 2.0)” branch accepts the edges plus an optional Canny Control LoRA for high fidelity to silhouettes, while “Canny to Video (LTX 2.0 Distilled)” offers a faster distilled checkpoint for quick iterations. Both variants let you optionally inject the first frame and choose image strength, then export either via CreateVideo or VHS_VideoCombine.

Video settings and export

Set width and height via Width (#175) and height (#173), the total frames with Frame Count (#176), and toggle Enable First Frame (#177) if you want to lock in an initial reference. Use VHS_VideoCombine nodes at the end of each route to control crf, frame_rate, pix_fmt, and metadata saving. A dedicated SaveVideo (#180) is provided for the distilled canny route when you prefer direct VIDEO output.

Performance and multi‑GPU

The graph applies LTXVSequenceParallelMultiGPUPatcher (#44) with torch_compile enabled to split sequences across GPUs for lower latency. KSamplerSelect (#8) lets you pick between samplers including Euler and gradient‑estimation styles; smaller frame counts and lower steps reduce turnaround so you can iterate quickly and scale up when satisfied.

Key nodes in the LTX-2 ComfyUI workflow

LTX Multimodal Guider (#17). Coordinates how text conditioning steers both video and audio branches. Adjust cfg and modality in the linked LTX Guider Parameters (#18 for VIDEO, #19 for AUDIO) to balance faithfulness vs creativity; raise cfg for tighter prompt adherence and increase modality_scale to emphasize a specific branch.
LTXVScheduler (#9). Builds a sigma schedule tailored to LTX‑2’s latent space. Use steps to trade speed for quality; when prototyping, fewer steps cut latency, then raise steps for final renders.
SamplerCustomAdvanced (#41). The denoiser that ties together RandomNoise, the chosen sampler from KSamplerSelect (#8), the scheduler’s sigmas, and the AV latent. Switch samplers for different motion textures and convergence behavior.
LTX LTXV Img To Video Inplace (see I2V branches, e.g., #107). Injects an image into a video latent so the first frame anchors content while the model synthesizes motion. Tune strength for how strictly the first frame is preserved.
LTX LTXV Add Guide (in guided routes, e.g., depth/pose/canny). Adds a structural guide (image, pose, or edges) directly in latent space. Use strength to balance guide fidelity with generative freedom and enable the first frame only when you want temporal anchoring.
Video Combine 🎥🅥🅗🅢 (#15 and siblings). Packages decoded frames and the generated audio into MP4. For previews, raise crf (more compression); for finals, lower crf and confirm frame_rate matches what you set in conditioning.
LTXVSequenceParallelMultiGPUPatcher (#44). Enables sequence‑parallel inference with compile optimizations. Leave it on for best throughput; disable only when debugging device placement.

Optional extras

Prompting tips for LTX-2 ComfyUI
- Describe core actions over time, not just static appearance.
- Specify important visual details you must see in the video.
- Write the soundtrack: ambience, foley, music, and any dialog.
Sizing rules and frame rate
- Use width and height that are multiples of 32 (for example 1280×720).
- Use frame counts that are multiples of 8 (121 in this template is a good length).
- Keep frame rate consistent where it appears; the graph includes both float and int boxes and they should match.
LoRA guidance
- Camera, depth, pose, and canny LoRAs are integrated; start with strength 1 for camera moves, then add a second LoRA only when needed. Browse the official collection at Lightricks/LTX‑2.
Faster iterations
- Lower the frame count, reduce steps in LTXVScheduler, and try the distilled checkpoint for the canny route. When the motion works, scale resolution and steps for finals.
Reproducibility
- Lock noise_seed in the Random Noise nodes to get repeatable results while you tune prompts, sizes, and LoRAs.

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge Lightricks for the LTX-2 multimodal video generation model and the LTX-Video research codebase, and Comfy Org for the ComfyUI LTX-2 partner nodes/integration, for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

Comfy Org/LTX-2 Now Available in ComfyUI!
- GitHub: Lightricks/LTX-Video
- Hugging Face: Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8
- arXiv: 2501.00103
- Docs / Release Notes: LTX-2 Now Available in ComfyUI!

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Wan 2.1 | Revolutionary Video Generation

Create incredible videos from text or images with breakthrough AI running on everyday CPUs.

PuLID Flux II | Consistent Character Generation

Generate images with precise character control while preserving artistic style.

CogvideoX Fun | Video-to-Video Model

CogVideoX Fun: Advanced video-to-video model for high-quality video generation.

Wan 2.1 Fun | I2V + T2V

Empower your AI videos with Wan 2.1 Fun.

Wan 2.2 | Open-Source Video Gen Leader

Available now! Better precision + smoother motion.

AnimateDiff + IPAdapter V1 | Image to Video

With IPAdapter, you can efficiently control the generation of animations using reference images.

Wan 2.1 LoRA

Enhance Wan 2.1 video generation with LoRA models for improved style and customization.

Stable Diffusion 3.5

Stable Diffusion 3.5 (SD3.5) for high-quality, diverse image generation.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

LTX-2 ComfyUI | Real-Time Video Generator

LTX-2 ComfyUI: real-time text, image, depth and pose to video with synchronized audio

Key models in the LTX-2 ComfyUI workflow

How to use the LTX-2 ComfyUI workflow

T2V: generate video and audio from a prompt

I2V: animate a still

Depth to video: structure‑aware motion from depth maps

Pose to video: drive motion from human pose

Canny to video: edge‑faithful animation and distilled speed mode

Video settings and export

Performance and multi‑GPU

Key nodes in the LTX-2 ComfyUI workflow

Optional extras

Acknowledgements

Resources

Want More ComfyUI Workflows?

Wan 2.1 | Revolutionary Video Generation

PuLID Flux II | Consistent Character Generation

CogvideoX Fun | Video-to-Video Model

Wan 2.1 Fun | I2V + T2V

Wan 2.2 | Open-Source Video Gen Leader

AnimateDiff + IPAdapter V1 | Image to Video

Wan 2.1 LoRA

Stable Diffusion 3.5