Grounding Detector:
The GroundingDetector is a versatile node designed to facilitate object detection within images using advanced grounding models. Its primary function is to automatically identify and utilize the appropriate detection method based on the model type, making it a unified solution for various grounding tasks. This node is particularly beneficial for AI artists and developers who need to integrate object detection capabilities into their projects without delving into the complexities of model selection and configuration. By leveraging the GroundingDINO detection logic, the GroundingDetector can process images and text prompts to identify and annotate objects with high precision. This capability is essential for applications that require detailed image analysis, such as automated image tagging, content creation, and interactive media development.
Grounding Detector Input Parameters:
model_dict
The model_dict parameter is a dictionary that contains the model, processor, type, and framework information necessary for the detection process. It serves as the blueprint for the node to understand which model to use and how to process the input data. This parameter is crucial as it directly influences the detection method and accuracy. There are no specific minimum or maximum values, but it must be a valid dictionary with the required keys.
image
The image parameter is a tensor representing the image to be processed, with dimensions (B, H, W, C), where B is the batch size, H is the height, W is the width, and C is the number of color channels. This parameter is the primary input for the detection process, and its quality and resolution can significantly impact the results. There are no specific constraints on the image size, but higher resolution images may yield more accurate detections.
prompt
The prompt parameter is a text string used to guide the detection process. It can contain multiple object labels separated by periods, which the node will use to identify and annotate objects within the image. This parameter is essential for specifying the objects of interest and can greatly affect the detection outcome. There are no specific constraints on the prompt length, but it should be clear and concise for optimal results.
box_threshold
The box_threshold parameter is a numerical value that sets the confidence threshold for the bounding boxes. It determines the minimum confidence level required for a box to be considered valid. This parameter is important for filtering out low-confidence detections and ensuring that only the most reliable results are returned. The typical range is between 0 and 1, with a default value that balances precision and recall.
text_threshold
The text_threshold parameter is a numerical value that sets the confidence threshold for the text prompts. It determines the minimum confidence level required for a text prompt to be considered valid. This parameter is crucial for ensuring that only the most relevant text prompts are used in the detection process. The typical range is between 0 and 1, with a default value that optimizes detection accuracy.
single_box_mode
The single_box_mode parameter is a boolean flag that, when enabled, instructs the node to return only the highest confidence box for each detected object. This parameter is useful for applications where only the most prominent object is of interest. It does not have a specific range, as it is a true or false setting.
single_box_per_prompt_mode
The single_box_per_prompt_mode parameter is a boolean flag that, when enabled, instructs the node to return the highest confidence box for each label specified in the prompt. This parameter is beneficial for ensuring that each object of interest is represented by its most confident detection. It is a true or false setting without a specific range.
bbox_output_format
The bbox_output_format parameter specifies the format in which the bounding box data should be output. This parameter is important for ensuring compatibility with downstream processes and applications. It typically offers options such as "dict_with_data" or other structured formats.
output_masks
The output_masks parameter is a boolean flag that, when enabled, instructs the node to output masks for the detected objects in addition to bounding boxes. This parameter is useful for applications that require detailed object segmentation. It is a true or false setting without a specific range.
format_output_fn
The format_output_fn parameter is a function that formats the output data according to the specified requirements. This parameter is essential for customizing the output to meet specific needs and ensuring that the results are presented in a usable format. It does not have a specific range, as it is a function.
Grounding Detector Output Parameters:
annotated_image
The annotated_image output is an image tensor that contains the original image with detected objects annotated with bounding boxes and labels. This output is crucial for visualizing the detection results and verifying the accuracy of the process. It provides a clear and intuitive representation of the detected objects within the image.
boxes
The boxes output is an array of bounding box coordinates for the detected objects. This output is essential for applications that require precise object localization, such as automated cropping or object tracking. It provides the spatial information needed to identify the position and size of each detected object.
labels
The labels output is a list of labels corresponding to the detected objects. This output is important for understanding the types of objects present in the image and for applications that require object classification. It provides a textual representation of the detected objects, making it easier to interpret the results.
scores
The scores output is an array of confidence scores for each detected object. This output is crucial for assessing the reliability of the detections and for filtering out low-confidence results. It provides a quantitative measure of the detection accuracy, allowing users to make informed decisions about the results.
masks
The masks output is an optional array of masks for the detected objects, providing detailed segmentation information. This output is important for applications that require precise object boundaries, such as image editing or augmented reality. It offers a pixel-level representation of the detected objects, enabling fine-grained analysis and manipulation.
Grounding Detector Usage Tips:
- Ensure that the
model_dictis correctly configured with the appropriate model and processor to achieve optimal detection results. - Use clear and specific prompts to guide the detection process and improve the accuracy of the results.
- Adjust the
box_thresholdandtext_thresholdparameters to balance precision and recall based on the specific requirements of your application. - Enable
single_box_modeorsingle_box_per_prompt_modeif you are interested in only the most prominent detections for each object or label.
Grounding Detector Common Errors and Solutions:
Model not found in model_dict
- Explanation: This error occurs when the
model_dictdoes not contain a valid model entry. - Solution: Ensure that the
model_dictis correctly populated with the necessary model information before running the detection process.
Invalid image tensor shape
- Explanation: This error arises when the input image tensor does not have the expected shape (B, H, W, C).
- Solution: Verify that the input image tensor is correctly formatted and matches the expected dimensions.
Prompt parsing error
- Explanation: This error occurs when the prompt string is not correctly formatted, leading to issues in parsing the text queries.
- Solution: Ensure that the prompt string is well-formed, using periods to separate multiple object labels and avoiding unnecessary punctuation.
Output format function error
- Explanation: This error happens when the
format_output_fndoes not correctly format the output data. - Solution: Check that the
format_output_fnis properly implemented and compatible with the expected output structure.
