This workflow packages ComfyUI Grounding into three practical paths for image batches, single images, and videos. It turns natural-language prompts into object bounding boxes and high-quality masks, then previews RGBA composites or writes annotated videos with preserved audio. Artists, editors, and VFX generalists can quickly isolate subjects, knock out backgrounds, and generate clean overlays for compositing.
Built on open-vocabulary detection and modern segmentation, ComfyUI Grounding is reliable for varied subjects and scenes. You can drive selection with short prompts, refine with segmentation, and keep frame timing intact when round-tripping video.
The workflow contains four self-contained groups. Pick the path that matches your task; each can be run independently.
This path processes a folder of images and outputs RGBA composites. LoadImagesFromFolderKJ (#9) reads your batch, while GroundingModelLoader (#3) brings in Florence-2. Provide a short prompt in GroundingDetector (#1) to propose boxes around your target; adjust confidence if you see misses or false positives. DownLoadSAM2Model (#12) loads SAM 2 and Sam2Segment (#11) converts the boxes to a clean mask. Optionally flip the selection with InvertMask (#15) and preview the cutout with alpha using JoinImageWithAlpha (#14) and PreviewImage (#17).
Use this for quick prompt checks on a single frame. LoadImage (#24) brings in your image and GroundingDetector (#25) draws labeled boxes based on your text prompt. PreviewImage (#26) shows the annotated result so you can iterate on wording before batch or video work.
This path creates a one-step, text-driven segmentation overlay. GroundingMaskModelLoader (#21) loads the mask model and LoadImage (#18) supplies the frame. Type a descriptive instruction in GroundingMaskDetector (#22) to directly obtain a mask and an overlaid preview; PreviewImage (#20) displays the composite, while PreviewAny (#19) shows the resolved instruction string. It is ideal when you want a fast semantic selection without separate detection and refinement.
This path overlays detections on video frames and re-encodes a synced clip. VHS_LoadVideo (#32) imports frames and audio, and GroundingModelLoader (#30) provides Florence-2. Set a prompt such as “faces” in GroundingDetector (#28) to draw boxes per frame. VHS_VideoInfo (#40) forwards the loaded frame rate to VHS_VideoCombine (#39), which writes an MP4 with the original audio and matched timing. The result is a ready-to-share annotated video for review or shot planning.
GroundingDetector (#1)Core detector turning your text prompt into bounding boxes. Raise the score threshold for fewer false positives; lower it if the target is small or partially occluded. Keep prompts short and specific, for example “red umbrella” rather than long sentences. Use this node to drive both segmentation and visualization stages downstream.
Sam2Segment (#11)Refines coarse boxes into crisp masks using SAM 2. Feed it boxes from GroundingDetector; add a few positive or negative points only when the boundary needs extra guidance. If the subject and background flip, pair with InvertMask for the intended cutout. Use the result wherever an alpha matte is required.
GroundingMaskDetector (#22)Generates a semantic mask directly from a natural-language instruction. This is best when you want a one-click selection without assembling a detection-to-segmentation chain. Tighten the text and increase confidence if multiple regions are being picked up; broaden the wording to include variations when the subject is missed.
JoinImageWithAlpha (#14)Composites the original image with the mask into an RGBA output for downstream editors. Use it when you need transparent backgrounds, selective effects, or layered comp work. Combine with InvertMask to switch between isolating the subject and cutting the subject out.
VHS_LoadVideo (#32)Splits a video into frames and extracts audio for processing. If your source has a variable frame rate, rely on the loaded frame rate it reports to keep timing consistent. This node is the entry point for any frame-by-frame detection or segmentation across a clip.
VHS_VideoCombine (#39)Re-encodes processed frames into an MP4 while preserving audio. Match the frame rate to the value reported upstream to avoid time drift. Use the filename prefix to keep different runs organized in your output folder.
This workflow implements and builds upon the following works and resources. We gratefully acknowledge PozzettiAndrea for ComfyUI-Grounding for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.
Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.