Prompt Engineering
The GroundingDino model encodes text prompts into a learned latent space. Altering the prompts can lead to different text features, which can affect the performance of the detector. To enhance prediction performance, it’s advisable to experiment with multiple prompts, choosing the one that delivers the best results. It’s important to note that while writing this article I had to try several prompts before finding the ideal one, sometimes encountering unexpected results.
Getting Started
To begin, we’ll clone the GroundingDino repository from GitHub, set up the environment by installing the necessary dependencies, and download the pre-trained model weights.
# Clone:
!git clone https://github.com/IDEA-Research/GroundingDINO.git# Install
%cd GroundingDINO/
!pip install -r requirements.txt
!pip install -q -e .
# Get weights
!wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
Inference on an image
We’ll start our exploration of the object detection algorithm by applying it to a single image of tomatoes. Our initial goal is to detect all the tomatoes in the image, so we’ll use the text prompt tomato
. If you want to use different category names, you can separate them with a dot .
. Note that the colors of the bounding boxes are random and have no particular meaning.
python3 demo/inference_on_a_image.py \
--config_file 'groundingdino/config/GroundingDINO_SwinT_OGC.py' \
--checkpoint_path 'groundingdino_swint_ogc.pth' \
--image_path 'tomatoes_dataset/tomatoes1.jpg' \
--text_prompt 'tomato' \
--box_threshold 0.35 \
--text_threshold 0.01 \
--output_dir 'outputs'
GroundingDino not only detects objects as categories, such as tomato, but also comprehends the input text, a task known as Referring Expression Comprehension (REC). Let’s change the text prompt from tomato
to ripened tomato
, and obtain the outcome:
python3 demo/inference_on_a_image.py \
--config_file 'groundingdino/config/GroundingDINO_SwinT_OGC.py' \
--checkpoint_path 'groundingdino_swint_ogc.pth' \
--image_path 'tomatoes_dataset/tomatoes1.jpg' \
--text_prompt 'ripened tomato' \
--box_threshold 0.35 \
--text_threshold 0.01 \
--output_dir 'outputs'
Remarkably, the model can ‘understand’ the text and differentiate between a ‘tomato’ and a ‘ripened tomato’. It even tags partially ripened tomatoes that aren’t fully red. If our task requires tagging only fully ripened red tomatoes, we can adjust the box_threshold
from the default 0.35 to 0.5.
python3 demo/inference_on_a_image.py \
--config_file 'groundingdino/config/GroundingDINO_SwinT_OGC.py' \
--checkpoint_path 'groundingdino_swint_ogc.pth' \
--image_path 'tomatoes_dataset/tomatoes1.jpg' \
--text_prompt 'ripened tomato' \
--box_threshold 0.5 \
--text_threshold 0.01 \
--output_dir 'outputs'
Generation of tagged dataset
Even though GroundingDino has remarkable capabilities, it’s a large and slow model. If real-time object detection is needed, consider using a faster model like YOLO. Training YOLO and similar models require a lot of tagged data, which can be expensive and time-consuming to produce. However, if your data isn’t unique, you can use GroundingDino to tag it. To learn more about efficient YOLO training, refer to my previous article [4].
The GroundingDino repository includes a script to annotate image datasets in the COCO format, which is suitable for YOLOx, for instance.
from demo.create_coco_dataset import mainmain(image_directory= 'tomatoes_dataset',
text_prompt= 'tomato',
box_threshold= 0.35,
text_threshold = 0.01,
export_dataset = True,
view_dataset = False,
export_annotated_images = True,
weights_path = 'groundingdino_swint_ogc.pth',
config_path = 'groundingdino/config/GroundingDINO_SwinT_OGC.py',
subsample = None
)
- export_dataset — If set to True, the COCO format annotations will be saved in a directory named ‘coco_dataset’.
- view_dataset — If set to True, the annotated dataset will be displayed for visualization in the FiftyOne app.
- export_annotated_images — If set to True, the annotated images will be stored in a directory named ‘images_with_bounding_boxes’.
- subsample (int) — If specified, only this number of images from the dataset will be annotated.
Different YOLO algorithms require different annotation formats. If you’re planning to train YOLOv5 or YOLOv8, you’ll need to export your dataset in the YOLOv5 format. Although the export type is hard-coded in the main script, you can easily change it by adjusting the dataset_type
argument in create_coco_dataset.main
, from fo.types.COCODetectionDataset
to fo.types.YOLOv5Dataset
(line 72). To keep things organized, we’ll also change the output directory name from ‘coco_dataset’ to ‘yolov5_dataset’. After changing the script, run create_coco_dataset.main
again.
if export_dataset:
dataset.export(
'yolov5_dataset',
dataset_type=fo.types.YOLOv5Dataset
)
GroundingDino offers a significant leap in object detection annotations by using text prompts. In this tutorial, we have explored how to use the model for automated labeling of an image or a whole dataset. It’s crucial, however, to manually review and verify these annotations before they are utilized in training subsequent models.
_________________________________________________________________
A user-friendly Jupyter notebook containing the complete code is included for your convenience:
Want to learn more?
[1] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, 2023.
[2] Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
[3] An Open and Comprehensive Pipeline for Unified Object Grounding and Detection, 2023.
[4] The practical guide for Object Detection with YOLOv5 algorithm, by Dr. Lihi Gur Arie.