hailuo-2-3/standard/image-to-video
Transform images into motion-rich clips with Hailuo 2.3's precise control and realistic visuals.
Transform raw footage into precisely segmented, stable video outputs using text or image prompts, with open-vocabulary tracking, real-time accuracy, and seamless frame-by-frame identity control.






SAM 3 is a high-fidelity video-to-video segmentation and tracking model built to preserve structure, identity, and temporal coherence. SAM 3 converts raw footage into stable, frame-accurate masks driven by concise natural-language prompts, enabling targeted operations without regenerating full frames. With open-vocabulary understanding and frame-by-frame identity control, SAM 3 maintains consistent masks through motion, occlusion, and lighting changes. Its design emphasizes reliability and responsiveness for iterative workflows that demand precise boundaries, low drift on long shots, and consistent results suitable for downstream editorial and VFX pipelines. Key capabilities of SAM 3:
When using SAM 3, start by supplying video_url and a focused text_prompt that names the subject to segment and track. Describe the target with clear nouns and a few attributes, and state what to exclude or preserve to help SAM 3 focus. Use consistent terminology across shots to help SAM 3 maintain continuity. For crowded scenes, begin with the primary subject, review the mask, then refine with small prompt adjustments so SAM 3 stays stable. SAM 3 supports open-vocabulary phrasing, but specificity improves stability, especially across fast motion or partial occlusion. Examples:
Transform images into motion-rich clips with Hailuo 2.3's precise control and realistic visuals.
Create structured cinematic clips with audio, scene links, and prompt accuracy
Transform and restyle clips to 4K using fast, precise ByteDance-powered generation.
First-frame restyle locks cinematic look across full AI video.
Generate cinematic motion from text or images with efficient 3D VAE-based video synthesis for creatives.
Cinematic video edits with style control and object tuning
SAM 3 is Meta’s Segment Anything Model 3, designed for advanced detection, tracking, and segmentation across images and videos. Its video-to-video functionality allows users to apply segmentations consistently across frames, maintaining object identities and visual coherence throughout moving scenes.
SAM 3 enhances video-to-video segmentation by introducing 'promptable concept segmentation,' which supports natural language and exemplar-based inputs. This makes it significantly more flexible and accurate compared to SAM 1 and SAM 2, especially for tracking objects under motion, occlusion, or lighting changes.
Access to SAM 3’s video-to-video features requires Runcomfy credits. Users receive free trial credits upon account creation, after which they can purchase additional credits depending on their usage and generation needs.
SAM 3 delivers stable identity tracking, natural prompt support, and robust object segmentation under motion and lighting changes, making it ideal for professional video-to-video editing tasks such as visual effects, object removal, and selective modifications.
SAM 3’s video-to-video segmentation tools are widely used by content creators, filmmakers, video editors, and developers integrating AI vision into applications. It’s also valuable for e-commerce previews, AR experiences, and AI research in computer vision.
SAM 3 achieves approximately 75–80% of human cgF1 accuracy on complex benchmarks. Its DETR-based architecture and presence token system ensure precise mask selection and continuity in video-to-video tracking tasks.
SAM 3 supports prompts via text, boxes, points, or exemplar masks as inputs, and produces detailed segmented frames or annotated video outputs. This versatility allows users to craft refined video-to-video datasets or visual compositions.
Users can access SAM 3 through the Runcomfy AI Playground platform at runcomfy.com/playground, which supports web and mobile browsers. After logging in, users can test the video-to-video segmentation capabilities directly through the interface.
While SAM 3 is powerful, performance may vary with very long videos, extreme occlusions, or high motion blur. The video-to-video results can also be influenced by input prompt quality and available compute resources on Runcomfy.