ComfyUI>Workflows>ComfyUI Grounding | Object Tracking Workflow

ComfyUI Grounding | Object Tracking Workflow

Workflow Name: RunComfy/ComfyUI-Grounding

Workflow ID: 0000...1310

This workflow helps you isolate and track objects across video frames with pixel-level accuracy. It allows you to generate clean, consistent masks and motion data for compositing and advanced VFX tasks. Whether you need character isolation, background cleanup, or targeted edits, it provides reliable tracking every time. You can guide the process with text prompts or visual references. Perfect for creators seeking accurate, frame-consistent segmentation for visual effects and AI-driven editing.

ComfyUI Grounding: Prompted detection, pixel-accurate segmentation, and video overlays

This workflow packages ComfyUI Grounding into three practical paths for image batches, single images, and videos. It turns natural-language prompts into object bounding boxes and high-quality masks, then previews RGBA composites or writes annotated videos with preserved audio. Artists, editors, and VFX generalists can quickly isolate subjects, knock out backgrounds, and generate clean overlays for compositing.

Built on open-vocabulary detection and modern segmentation, ComfyUI Grounding is reliable for varied subjects and scenes. You can drive selection with short prompts, refine with segmentation, and keep frame timing intact when round-tripping video.

Key models in Comfyui ComfyUI Grounding workflow

Microsoft Florence-2 Large. A vision-language model that supports open-vocabulary detection from natural-language prompts, enabling flexible box proposals for arbitrary objects. Model card
Segment Anything 2 (SAM 2). A segmentation foundation model that turns points or boxes into crisp masks; here it refines Florence-2 detections into pixel-accurate selections. Repository

How to use Comfyui ComfyUI Grounding workflow

The workflow contains four self-contained groups. Pick the path that matches your task; each can be run independently.

Batch - Normal

This path processes a folder of images and outputs RGBA composites. LoadImagesFromFolderKJ (#9) reads your batch, while GroundingModelLoader (#3) brings in Florence-2. Provide a short prompt in GroundingDetector (#1) to propose boxes around your target; adjust confidence if you see misses or false positives. DownLoadSAM2Model (#12) loads SAM 2 and Sam2Segment (#11) converts the boxes to a clean mask. Optionally flip the selection with InvertMask (#15) and preview the cutout with alpha using JoinImageWithAlpha (#14) and PreviewImage (#17).

Normal - Image

Use this for quick prompt checks on a single frame. LoadImage (#24) brings in your image and GroundingDetector (#25) draws labeled boxes based on your text prompt. PreviewImage (#26) shows the annotated result so you can iterate on wording before batch or video work.

Segment - Mask

This path creates a one-step, text-driven segmentation overlay. GroundingMaskModelLoader (#21) loads the mask model and LoadImage (#18) supplies the frame. Type a descriptive instruction in GroundingMaskDetector (#22) to directly obtain a mask and an overlaid preview; PreviewImage (#20) displays the composite, while PreviewAny (#19) shows the resolved instruction string. It is ideal when you want a fast semantic selection without separate detection and refinement.

Normal - Video

This path overlays detections on video frames and re-encodes a synced clip. VHS_LoadVideo (#32) imports frames and audio, and GroundingModelLoader (#30) provides Florence-2. Set a prompt such as “faces” in GroundingDetector (#28) to draw boxes per frame. VHS_VideoInfo (#40) forwards the loaded frame rate to VHS_VideoCombine (#39), which writes an MP4 with the original audio and matched timing. The result is a ready-to-share annotated video for review or shot planning.

Key nodes in Comfyui ComfyUI Grounding workflow

`GroundingDetector` (#1)

Core detector turning your text prompt into bounding boxes. Raise the score threshold for fewer false positives; lower it if the target is small or partially occluded. Keep prompts short and specific, for example “red umbrella” rather than long sentences. Use this node to drive both segmentation and visualization stages downstream.

`Sam2Segment` (#11)

Refines coarse boxes into crisp masks using SAM 2. Feed it boxes from GroundingDetector; add a few positive or negative points only when the boundary needs extra guidance. If the subject and background flip, pair with InvertMask for the intended cutout. Use the result wherever an alpha matte is required.

`GroundingMaskDetector` (#22)

Generates a semantic mask directly from a natural-language instruction. This is best when you want a one-click selection without assembling a detection-to-segmentation chain. Tighten the text and increase confidence if multiple regions are being picked up; broaden the wording to include variations when the subject is missed.

`JoinImageWithAlpha` (#14)

Composites the original image with the mask into an RGBA output for downstream editors. Use it when you need transparent backgrounds, selective effects, or layered comp work. Combine with InvertMask to switch between isolating the subject and cutting the subject out.

`VHS_LoadVideo` (#32)

Splits a video into frames and extracts audio for processing. If your source has a variable frame rate, rely on the loaded frame rate it reports to keep timing consistent. This node is the entry point for any frame-by-frame detection or segmentation across a clip.

`VHS_VideoCombine` (#39)

Re-encodes processed frames into an MP4 while preserving audio. Match the frame rate to the value reported upstream to avoid time drift. Use the filename prefix to keep different runs organized in your output folder.

Optional extras

Keep ComfyUI Grounding prompts short and noun-focused; add one or two attributes when necessary, for example “yellow excavator” or “lead singer with glasses”.
For busy scenes, increase the detector confidence and reduce the maximum box count to stabilize results before sending boxes to SAM 2.
When preparing video, trim or subsample on import for faster iterations, then switch back to full frame counts for final renders.
If you mainly need semantic masks without box control, run the Segment - Mask path; otherwise prefer the detector-plus-SAM 2 route for precise edges.
The nodes come from the ComfyUI Grounding extension; see the project for updates and supported models. Repository

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge PozzettiAndrea for ComfyUI-Grounding for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

PozzettiAndrea/ComfyUI-Grounding
- GitHub: ComfyUI-Grounding

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

Parchment Style with Unsampling

Convert your video into parchment-style animations using Unsampling method.

Clay Style with Unsampling

Convert your video into clay style using Unsampling method.

Motion Graphics Animation Effects | Vid2Vid

Achieve motion graphics animation effects starting from a pre-existing video input.

Consistent Style Transfer with Unsampling

Controlling latent noise with Unsampling helps dramatically increase consistency in video style transfer.

DiffuEraser | Video Inpainting

Erase objects from videos with auto-masking and realistic reconstruction.

Advanced Live Portrait | Parameter Control

Use customizable parameters to control every feature, from eye blinks to head movements, for natural results.

IPAdapter V1 FaceID Plus | Consistent Characters

Leverage IPAdapter FaceID Plus V2 model to create consistent characters.

Vid2Vid Part 2 | SDXL Style Transfer

Enhance Vid2Vid creativity by focusing on the composition and masking of your original video.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.

ComfyUI Grounding | Object Tracking Workflow

ComfyUI Grounding: Prompted detection, pixel-accurate segmentation, and video overlays

Key models in Comfyui ComfyUI Grounding workflow

How to use Comfyui ComfyUI Grounding workflow

Batch - Normal

Normal - Image

Segment - Mask

Normal - Video

Key nodes in Comfyui ComfyUI Grounding workflow

GroundingDetector (#1)

Sam2Segment (#11)

GroundingMaskDetector (#22)

JoinImageWithAlpha (#14)

VHS_LoadVideo (#32)

VHS_VideoCombine (#39)

Optional extras

Acknowledgements

Resources

Want More ComfyUI Workflows?

Parchment Style with Unsampling

Clay Style with Unsampling

Motion Graphics Animation Effects | Vid2Vid

Consistent Style Transfer with Unsampling

DiffuEraser | Video Inpainting

Advanced Live Portrait | Parameter Control

IPAdapter V1 FaceID Plus | Consistent Characters

Vid2Vid Part 2 | SDXL Style Transfer

`GroundingDetector` (#1)

`Sam2Segment` (#11)

`GroundingMaskDetector` (#22)

`JoinImageWithAlpha` (#14)

`VHS_LoadVideo` (#32)

`VHS_VideoCombine` (#39)