meta/sam-3/video-to-video

Transform raw footage into precisely segmented, stable video outputs using text or image prompts, with open-vocabulary tracking, real-time accuracy, and seamless frame-by-frame identity control.

The URL of the video to be segmented.
Text prompt for segmentation.

Introduction to SAM 3 Video-to-Video Generator

Released by Meta Platforms on November 19, 2025, SAM 3 is the newest evolution of the Segment Anything series and marks a significant leap into intelligent visual understanding. Built as a unified image and video foundation model with roughly 848 million parameters, SAM 3 delivers open-vocabulary segmentation and tracking that understand both natural language and visual cues like boxes, points, or exemplar images. Its upgraded DETR-based architecture with a presence token enables it to maintain consistent object identity in challenging video scenes while handling motion, occlusion, and nuanced concepts with near human-level accuracy. For creators, editors, and developers, SAM 3 video-to-video capabilities set a new benchmark in precision and automation for real-time visual tasks. SAM 3 video-to-video generation tool helps you convert raw footage into dynamic, intelligently segmented outputs. Designed for video editors, content creators, and AI-powered developers, it lets you track, isolate, and manipulate objects with text or exemplar prompts. The tool brings visual editing freedom, accurate segmentation, and stable identities frame-by-frame—helping you transform video-to-video projects faster and with unmatched clarity.

Examples of SAM 3 in Action

Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...
Video thumbnail
Loading...

What makes SAM 3 stand out

SAM 3 is a high-fidelity video-to-video segmentation and tracking model built to preserve structure, identity, and temporal coherence. SAM 3 converts raw footage into stable, frame-accurate masks driven by concise natural-language prompts, enabling targeted operations without regenerating full frames. With open-vocabulary understanding and frame-by-frame identity control, SAM 3 maintains consistent masks through motion, occlusion, and lighting changes. Its design emphasizes reliability and responsiveness for iterative workflows that demand precise boundaries, low drift on long shots, and consistent results suitable for downstream editorial and VFX pipelines. Key capabilities of SAM 3:

  • Open-vocabulary segmentation from concise text prompts.
  • Identity-stable tracking with persistent IDs across frames.
  • Structure- and motion-aware boundaries that respect edges and articulation.
  • Occlusion-tolerant propagation with minimal temporal jitter.
  • Frame-consistent masks ready for compositing, grading, and clean plate work.
  • Real-time responsiveness from SAM 3 for interactive review and rapid iteration.
  • Fine-grained regional control to isolate subjects or restrict changes.

Prompting guide for SAM 3

When using SAM 3, start by supplying video_url and a focused text_prompt that names the subject to segment and track. Describe the target with clear nouns and a few attributes, and state what to exclude or preserve to help SAM 3 focus. Use consistent terminology across shots to help SAM 3 maintain continuity. For crowded scenes, begin with the primary subject, review the mask, then refine with small prompt adjustments so SAM 3 stays stable. SAM 3 supports open-vocabulary phrasing, but specificity improves stability, especially across fast motion or partial occlusion. Examples:

  • red sedan car, foreground only, include wheels and windows
  • person with yellow jacket, exclude hands and face, background unchanged
  • soccer ball only, ignore players and crowd, maintain full silhouette
  • brown dog, keep ears and tail, do not segment grass
  • neon sign on the left wall, background element only Pro tips for SAM 3:
  • Be explicit about scope: say what to isolate and what must not change.
  • Use spatial cues like left, right, center, near camera, background only.
  • Prefer a few strong attributes over long descriptive lists.
  • Keep prompts consistent across cuts for SAM 3; re-run at clear scene boundaries.
  • Higher-resolution, less-compressed footage yields cleaner edges and steadier masks in SAM 3.

Related Playgrounds

Frequently Asked Questions

What is SAM 3 and how does its video-to-video functionality work?

SAM 3 is Meta’s Segment Anything Model 3, designed for advanced detection, tracking, and segmentation across images and videos. Its video-to-video functionality allows users to apply segmentations consistently across frames, maintaining object identities and visual coherence throughout moving scenes.

How does SAM 3 improve video-to-video segmentation compared to earlier versions?

SAM 3 enhances video-to-video segmentation by introducing 'promptable concept segmentation,' which supports natural language and exemplar-based inputs. This makes it significantly more flexible and accurate compared to SAM 1 and SAM 2, especially for tracking objects under motion, occlusion, or lighting changes.

Is there a cost to using SAM 3’s video-to-video features?

Access to SAM 3’s video-to-video features requires Runcomfy credits. Users receive free trial credits upon account creation, after which they can purchase additional credits depending on their usage and generation needs.

What makes SAM 3 suitable for professional video-to-video editing workflows?

SAM 3 delivers stable identity tracking, natural prompt support, and robust object segmentation under motion and lighting changes, making it ideal for professional video-to-video editing tasks such as visual effects, object removal, and selective modifications.

Who are the main users of SAM 3 and its video-to-video segmentation tools?

SAM 3’s video-to-video segmentation tools are widely used by content creators, filmmakers, video editors, and developers integrating AI vision into applications. It’s also valuable for e-commerce previews, AR experiences, and AI research in computer vision.

How accurate is SAM 3 for video-to-video object tracking?

SAM 3 achieves approximately 75–80% of human cgF1 accuracy on complex benchmarks. Its DETR-based architecture and presence token system ensure precise mask selection and continuity in video-to-video tracking tasks.

Which inputs and outputs does SAM 3 support in video-to-video processing?

SAM 3 supports prompts via text, boxes, points, or exemplar masks as inputs, and produces detailed segmented frames or annotated video outputs. This versatility allows users to craft refined video-to-video datasets or visual compositions.

Where can users access SAM 3 and its video-to-video capabilities?

Users can access SAM 3 through the Runcomfy AI Playground platform at runcomfy.com/playground, which supports web and mobile browsers. After logging in, users can test the video-to-video segmentation capabilities directly through the interface.

Are there any limitations to SAM 3’s video-to-video performance?

While SAM 3 is powerful, performance may vary with very long videos, extreme occlusions, or high motion blur. The video-to-video results can also be influenced by input prompt quality and available compute resources on Runcomfy.