LTX 2.3 First Last Frame in ComfyUI | Keyframe to Smooth Video

ComfyUI LTX 2.3 First Last Frame Workflow

Want to run this workflow?

Fully operational workflows
No missing nodes or models
No manual setups required
Features stunning visuals

ComfyUI LTX 2.3 First Last Frame Examples

LTX 2.3 First Last Frame to Video#

LTX 2.3 First Last Frame to Video is a ComfyUI workflow that turns two still images into a smooth, continuous video with synchronized audio. You provide a first frame, a last frame, and a natural-language prompt describing motion, scene details, and sound. Powered by the LTX-2.3 22B distilled FP8 checkpoint, the pipeline interpolates between the images while maintaining consistent appearance and timing. It is ideal for editors, motion designers, and storyboard artists who need a seamless transition or a short looping clip created directly inside ComfyUI.

This LTX 2.3 First Last Frame workflow emphasizes efficient inference and high prompt fidelity. FP8 weights keep VRAM use in check, while a Gemma 3 12B text encoder improves semantic understanding of both visual and audio instructions. The result is a coherent visual passage from first to last frame that honors your prompt and stays in sync with generated audio.

Key models in Comfyui LTX 2.3 First Last Frame workflow#

LTX-2.3 22B Distilled FP8 checkpoint by Lightricks. Core video generation model distilled for efficient inference, used here to synthesize temporally consistent frames while conditioning on the two image guides and the text prompt. Model card
Gemma 3 12B IT text encoder. Provides robust language understanding for both visual and audio aspects of the prompt, enabling accurate motion, scene attributes, and soundtrack cues. Model card
LTX-2.3 latent VAEs for video and audio. These components map images and waveform audio to compact latents and back during decoding, preserving quality while keeping sampling efficient. Shipped with the LTX-2.3 FP8 release. Model card

How to use Comfyui LTX 2.3 First Last Frame workflow#

This workflow takes two reference images and a prompt, builds conditioning with first and last frame guides, samples a video latent with synchronized audio, and decodes everything to a playable file.

Settings

Set your target resolution, frame count, and frame rate in the Settings group. Width and height define the working canvas; the input frames are resized to match so the model can interpolate cleanly. Frame count controls how long the transition lasts, and frame rate sets playback speed. Choose an aspect ratio that matches your sources to avoid unwanted cropping. The nodes WIDTH (#113), HEIGHT (#98), Length (#102), and Frame Rate(int) (#114) anchor these choices.

First Frame

Load your starting image in Load First Frame (#31). It is resized by ResizeImageMaskNode (#124) to the target dimensions and normalized by LTXVPreprocess (#104). This prepares the first frame to act as a strong structural and color guide at the beginning of the clip. Use a sharp, well-lit image for best results.

Last Frame

Load your ending image in Load Last Frame (#39). The image is matched to the same size with ResizeImageMaskNode (#125) and normalized by LTXVPreprocess (#99). This ensures the final look and layout you want at the end of the transition. For loops, make the last frame visually compatible with the first.

Prompt

The LTXAVTextEncoderLoader (#103) provides the text encoder, and two CLIPTextEncode nodes capture your positive and negative prompts. In the positive prompt (CLIPTextEncode (#128)), describe camera motion, subjects, lighting, and also include audio cues such as “Music: ambient pads with soft percussion” or “Dialogue: brief whisper.” The negative prompt (CLIPTextEncode (#112)) can list artifacts or traits you want to suppress.

Conditioning

LTXVConditioning (#109) merges the text conditioning with timing information so motion and audio align with your chosen frame rate. EmptyLTXVLatentVideo (#108) creates a video latent at your resolution and length. Two passes of LTXVAddGuide first attach the first frame (LTXVAddGuide (#115)) and then the last frame (LTXVAddGuide (#111)) so the model knows where to start and where to end. LTXVEmptyLatentAudio (#101) initializes an audio latent of matching duration, and LTXVConcatAVLatent (#119) bundles audio and video latents for sampling.

Model

CheckpointLoaderSimple (#127) loads the LTX-2.3 22B distilled FP8 weights and the video VAE, while LTXVAudioVAELoader (#126) provides the audio VAE. These are preconfigured so you can focus on creative inputs rather than setup details.

Sampling

CFGGuider (#116) balances adherence to your text and guide frames against creative freedom. RandomNoise (#100) sets a seed for reproducibility. The sampler uses SamplerEulerAncestral (#117) with a custom schedule from ManualSigmas (#118), orchestrated by SamplerCustomAdvanced (#120), to progressively refine the latent into a coherent sequence that follows your motion and audio instructions.

Decoding

After sampling, LTXVSeparateAVLatent (#121) splits the combined latent back into video and audio. LTXVCropGuides (#106) refines spatial guidance to reduce edge artifacts before image decoding. VAEDecodeTiled (#105) produces the frame sequence, and LTXVAudioVAEDecode (#107) generates the audio waveform. CreateVideo (#122) muxes frames and sound at your selected fps and SaveVideo (#68) writes the final file to your ComfyUI output.

Key nodes in Comfyui LTX 2.3 First Last Frame workflow#

EmptyLTXVLatentVideo (#108)

Defines the working resolution and duration of your clip. Adjust width, height, and length here to set visual scale and transition time. Longer durations need stronger motion cues in the prompt to avoid stagnation.

LTXVAddGuide (#115)

Injects the first frame as a structural and color anchor at the start of the sequence. If the opening drifts from your source, increase this guide’s influence; if it feels overconstrained, reduce it slightly to allow more motion.

LTXVAddGuide (#111)

Anchors the target look at the end of the clip using the last frame. If the transition overshoots or never quite lands on your last frame, raise the guide influence; if it snaps too hard near the end, ease it down.

CFGGuider (#116)

Controls how strongly the model follows text and image conditioning. Higher guidance emphasizes your prompt and guides but can reduce smoothness; lower values feel freer but may deviate from the intended look. Tweak in small steps and re-use the same seed when comparing.

SamplerCustomAdvanced (#120) with SamplerEulerAncestral (#117) and ManualSigmas (#118)

Drives denoising with a consistent schedule for stable motion. Shorter schedules render faster but can be rough; longer or gentler schedules improve consistency at additional compute cost. Keep the schedule consistent when A/B testing other parameters.

CreateVideo (#122)

Muxes decoded frames and audio into a final clip at your chosen frame rate. Use the same fps you conditioned with so lip shapes, footsteps, or music pulses remain aligned.

Optional extras#

Write prompts with verbs and timing: “camera trucks forward,” “lights dim as we approach,” “Music: sparse piano with soft reverb.” Clear verbs help the LTX 2.3 First Last Frame pipeline infer motion and rhythm.
Match the aspect ratio and orientation of your two images. Large mismatches can introduce unwanted cropping or stretching.
For seamless loops, make the last frame a near match to the first and keep camera motion cyclical.
Reuse a seed in RandomNoise to reproduce a look while iterating on prompts or guide strengths; change the seed to explore fresh variations.
If you need implementation details or custom node references, see ComfyUI’s LTX integrations and utilities such as ComfyUI-LTXTricks. Repository

Acknowledgements#

This workflow implements and builds upon the following works and resources. We gratefully acknowledge Lightricks for LTX-2.3 22B Distilled FP8 Checkpoint, Google for Gemma 3 12B IT FP4 Text Encoder, logtd for ComfyUI-LTXTricks Custom Nodes, and Comfy.org for Comfy.org Official Workflow for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources#

Lightricks/LTX-2.3 22B Distilled FP8 Checkpoint
- Hugging Face: Lightricks/LTX-2.3-fp8
Google/Gemma 3 12B IT FP4 Text Encoder
- Hugging Face: google/gemma-3-12b-it
logtd/ComfyUI-LTXTricks Custom Nodes
- GitHub: logtd/ComfyUI-LTXTricks
Comfy.org/Comfy.org Official Workflow
- Docs / Release Notes: comfy.org/workflows/video_ltx2_3_flf2v

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Wan 2.2 FLF2V | First-Last Frame Video Generation

Generate smooth videos from a start and end frame using Wan 2.2 FLF2V.

Wan 2.1 FLF2V | First-Last Frame Video

Generate smooth videos from a start and end frame using Wan 2.1 FLF2V.

Wan2.2 Fun Inp | Cinematic Video Generator

From 2 images to stunning videos with smooth, controllable transitions.

LTX-2 First Last Frame | Key Frames Video Generator

Turn still frames into seamless video and sound transitions fast.

Self Forcing | Autoregressive Keyframe-to-Video Generation

SUPER FAST! 5-second video in 45 seconds!

SUPIR | Photo-Realistic Image/Video Upscaler

SUPIR enables photo-realistic image restoration, works with SDXL model, and supports text-prompt enhancement.

Wan 2.2 Animate | Character Swap & Lip-Sync

Transforms any face to speak and move like the original with ease.

Video Character Replacement (MoCha) | Realistic Swap Tool

Swap video characters fast with realistic motion and lighting control.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

LTX 2.3 First Last Frame | Seamless Video Generator