Object Detection Models
Object detection is an involved process which helps in localization and classification of objects in a given image. In part 1, we developed an understanding of the basic concepts and the general framework for object detection. In this article, we will briefly cover a number of important object detection models with a focus on understanding their key contributions.
The general object detection framework highlights the fact that there are a few interim steps to perform object detection. Building on the same thought process, researchers have come up with a number of innovative architectures which solve this task of object detection. One of the ways of segregating such models is in the way they tackle the given task. Object detection models which leverage multiple models and/or steps to solve this task as called as multi-stage object detectors. The Region based CNN (RCNN) family of models are a prime example of multi-stage object detectors. Subsequently, a number of improvements led to model architectures that solve this task using a single model itself. Such models are called as single-stage object detectors. We will cover single-stage models in a subsequent article. For now, let us now have a look under the hood for some of these multi-stage object detectors.
Region Based Convolutional Neural Networks
Region based Convolutional Neural Networks (R-CNNs) were initially presented by Girshick et. al. in their paper titled “Rich feature hierarchies for accurate object detection and semantic segmentation” in 2013. R-CNN is a multi-stage object detection models which became the starting point for faster and more sophisticated variants in following years. Let’s get started with this base idea before we understand the improvements achieved through Fast R-CNN and Faster R-CNN models.
The R-CNN model is made up of four main components:
- Region Proposal: The extraction of regions of interest is the first and foremost step in this pipeline. The R-CNN model makes use of an algorithm called Selective Search for region proposal. Selective Search is a greedy search algorithm proposed by Uijlings et. al. in 2012. Without going into too many details, selective search makes use of a bottoms-up multi-scale iterative approach to identify ROIs. In every iteration the algorithm groups similar regions until the whole image is a single region. Similarity between regions is calculated based on color, texture, brightness etc. Selective search generates a lot of false positive (background) ROIs but has a high recall. The list of ROIs is passed onto the next step for processing.
- Feature Extraction: The R-CNN network makes use of pre-trained CNNs such as VGGs or ResNets for extracting features from each of the ROIs identified in the previous step. Before the regions/crops are passed as inputs to the pre-trained network these are reshaped or warped to the required dimensions (each pretrained network requires inputs in specific dimensions only). The pre-trained network is used without the final classification layer. The output of this stage is a long list of tensors, one for each ROI from the previous stage.
- Classification Head: The original R-CNN paper made use of Support Vector Machines (SVMs) as the classifier to identify the class of object in the ROI. SVM is a traditional supervised algorithm widely used for classification purposes. The output from this step is a classification label for every ROI.
- Regression Head: This module takes care of the localization aspect of the object detection task. As discussed in the previous section, bounding boxes can be uniquely identified using 4 coordinates (top-left (x, y) coordinates along with width and height of the box). The regressor outputs these 4 values for every ROI.
This pipeline is visually depicted in figure 1 for reference. As shown in the figure, the network requires multiple independent forward passes (one of each ROI) using the pretrained network. This is one of the primary reasons which slows down the R-CNN model, both for training as well as inference. The authors of the paper mention that it requires 80+ hours to train the network and an immense amount of disk space. The second bottleneck is the selective search algorithm itself.
The R-CNN model is a good example of how different ideas can be leveraged as building blocks to solve a complex problem. While we will have a detailed hands-on exercise to see object detection in context of transfer learning, in its original setup itself R-CNN makes use of transfer learning.
The R-CNN model was slow, but it provided a good base for object detection models to come down the line. The computationally expensive and slow feature extraction step was mainly addressed in the Fast R-CNN implementation. The Fast R-CNN was presented by Ross Grishick in 2015. This implementation boasts of not just faster training and inference but also improved mAP on PASCAL VOC 2012 dataset.
The key contributions from the Fast R-CNN paper can be summarized as follows:
- Region Proposal: For the base R-CNN model, we discussed how selective search algorithm is applied on the input image to generate thousands of ROIs upon which a pretrained network works to extract features. The Fast R-CNN changes this step to derive maximum impact. Instead of applying the feature extraction step using the pretrained network thousands of times, the Fast R-CNN network does it only once. In other words, we first process the whole input image through the pretrained network just once. The output features are then used as input for the selective search algorithm for identification of ROIs. This change in order of components reduces the computation requirements and performance bottleneck to a good extent.
- ROI Pooling Layer: The ROIs identified in the previous step can be arbitrary size (as identified by the selective search algorithm). But the fully connected layers after the ROIs have been extracted take only fixed size feature maps as inputs. The ROI pooling layer is thus a fixed size filter (the paper mentions a size of 7×7) which helps transform these arbitrary sized ROIs into fixed size output vectors. This layer works by first dividing the ROI into equal sized sections. It then finds the largest value in each section (similar to Max-Pooling operation). The output is just the max values from each of equal sized sections. The ROI pooling layer speeds up the inference and training times considerably.
- Multi-task Loss: As opposed to two different components (SVM and bounding box regressor) in R-CNN implementation, Faster R-CNN makes use of a multi-headed network. This setup enables the network to be trained jointly for both the tasks using a multi-task loss function. The multi-task loss is a weighted sum of classification and regression losses for object classification and bounding box regression tasks respectively. The loss function is given as:
Lₘₜ = Lₒ + 𝛾Lᵣ
where 𝛾 ≥ 1 if the ROI contains an object (objectness score), 0 otherwise. Classification loss is simply a negative log loss while the regression loss used in the original implementation is the smooth L1 loss.
The original paper details a number of experiments which highlight performance improvements based on various combinations of hyper-parameters and layers fine-tuned in the pre-trained network. The original implementation made use of pretrained VGG-16 as the feature extraction network. A number of faster and improved implementation such as MobileNet, ResNet, etc. have come up since the Fast R-CNN’s original implementation. These networks can also be swapped in place of VGG-16 to improve the performance further.
Faster R-CNN is the final member of this family of multi-stage object detectors. This is by far the most complex and fastest variant of them all. While Fast R-CNN improved training and inference times considerably it was still getting penalized due to the selective search algorithm. The Faster R-CNN model presented in 2016 by Ren et. al. in their paper titled “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” addresses the regional proposal aspect primarily. This network builds on top of Fast R-CNN network by introducing a novel component called Region Proposal Network (RPN). The overall Faster R-CNN network is depicted in figure 2 for reference.
RPN is a fully convolutional network (FCN) that helps in generating ROIs. As shown in figure 3.12, RPN consists of two layers only. The first being a 3×3 convolutional layer with 512 filters followed by two parallel 1×1 convolutional layers (one each for classification and regression respectively). The 3×3 convolutional filter is applied onto the feature map output of the pre-trained network (the input to which is the original image). Please note that the classification layer in RPN is a binary classification layer for determination of objectness score (not the object class). The bounding box regression is performed using 1×1 convolutional filters on anchor boxes. The proposed setup in the paper uses 9 anchor boxes per window, thus the RPN generates 18 objectness scores (2xK) and 36 location coordinates (4xK), where K=9 is the number of anchor boxes. The use of RPN (instead of selective search) improves the training and inference times by orders of magnitudes.
The Faster R-CNN network is an end-to-end object detection network. Unlike the base R-CNN and Fast R-CNN models which made use of a number of independent components for training, Faster R-CNN can be trained as a whole.
This concludes our discussion on the R-CNN family of object detectors. We discussed key contributions to better understand how these networks work.