A Survey of Deep learning based Object detection models

9 min readNov 24, 2021

Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors

The goal of the object detection is to detect all instances of the predefined classes and provide its coarse localization in the image by axis-aligned boxes. The detector should be able to identify all instances of the object classes and draw bounding box around it. It is generally seen as a supervised learning problem. Modern object detection models have access to large sets of labelled images for training and are evaluated on various canonical benchmarks.

Key challenges in Object Detection

Intra class variation: Intra class variation between the instances of same object is relatively common in nature. This variation could be due to various reasons like occlusion, illumination, pose, viewpoint, etc. These unconstrained external can have dramatic effect of the object appearance. It is expected that the objects could have non-rigid deformation or be rotated, scaled or blurry. Some objects could have inconspicuous surroundings, making the extraction difficult.
Number of categories: The sheer number of object classes available to classify makes it a challenging problem to solve. It also requires more high-quality annotated data, which is hard to come by. Using fewer examples for training a detector is an open research question.
Efficiency: Present day models need high computation resources to generate accurate detection results. With mobile and edge devices becoming common place, efficient object detectors are crucial for further development in the field of computer vision.

Backbone Architectures

Backbone architectures are one of the most important component of the object detector. These networks extract feature from the input image used by the model. Alexnet, VGG, GoogLeNet/Inception, ResNets, ResNeXt, CSPNet and EfficientNet are backbone architectures efficient in object detection.

OBJECT DETECTORS

There exists two types of detectors — two-stage and single-stage detectors.

Two-Stage Detectors

A network which has a separate module to generate region proposals is termed as a two-stage detector

R-CNN: The Region-based Convolutional Neural Network (R-CNN) was the first paper in the R-CNN family, which demonstrated how CNNs can be used to immensely improve the detection performance.

A mean-subtracted input image is first passed through the region proposal module, which produces 2000 object candidates. This module find parts of the image which has a higher probability of finding an object using Selective Search. These candidates are then warped and propagated through a CNN network, which extracts a 4096-dimension feature vector for each proposal.

The feature vectors are then passed to the trained, class-specific Support Vector Machines (SVMs) to obtain confidence scores. Non-maximum suppression (NMS) is later applied to the scored regions, based on its IoU and class. Once the class has been identified, the algorithm predicts its bounding box using a trained bounding-box regressor, which predicts four parameters i.e., center coordinates of box along with its width and height.

SPP-Net: SPP stands for Spatial Pyramid Pooling layer which assists in processing image of arbitrary size or aspect ratio.

SPP-net merely shifted the convolution layers of CNN before the region proposal module and added a pooling layer, thereby making the network independent of size/aspect ratio and reducing the computations. The selective search algorithm is used to generate candidate windows. Feature maps are obtained by passing the input image through the convolution layers of a ZF-5 network. The candidate windows are then mapped on to the feature maps, which are subsequently converted into fixed length representations by spatial bins of a pyramidal pooling layer. This vector is passed to the fully connected layer and ultimately, to SVM classifiers to predict class and score.

SPP-Net is considerably faster than the R-CNN model with comparable accuracy. It can process images of any shape/aspect ratio and thus, avoid object deformation due to input wrapping.

Fast R-CNN: The network takes an image as input and its object proposals.

The image is passed through a set of convolution layers and the object proposals are mapped to the obtained feature map. RoI pooling layer is connected to 2 fully connected layer and then branches out into a N+1-class SoftMax layer and a bounding box regressor layer, which has a fully connected layer as well.

The model also changed the loss function of bounding box regressor from L2 to smooth L1 to better performance, while introducing a multi-task loss to train the network.

Faster R-CNN: Faster R-CNN is essentially Fast R-CNN with RPN as region proposal module.

It has a fully convoluted network as a region proposal network (RPN) that takes an arbitrary input image and outputs a set of candidate windows. Each such window has an associated objectness score which determines likelihood of an object. RPN introduces Anchor boxes to solve size variance of objects. It used multiple bounding boxes of different aspect ratios and regressed over them to localize object.

The input image is first passed through the CNN to obtain a set of feature maps. These are forwarded to the RPN, which produces bounding boxes and their classification. Selected proposals are then mapped back to the feature maps obtained from previous CNN layer in RoI pooling layer, and ultimately fed to fully connected layer, which is sent to classifier and bounding box regressor.

FPN: Feature Pyramid Network (FPN) has a top-down architecture with lateral connections to build high-level semantic features at different scales.

It has two pathways, a bottom-up pathway which is a ConvNet computing feature hierarchy at several scales and a top-down pathway which upsamples coarse feature maps from higher level into high- resolution features. These pathways are connected by lateral connection by a 1x1 convolution operation to enhance the semantic information in the features. FPN is used as a region proposal network (RPN) of a ResNet-101 based Faster R-CNN.

It also lead to development of other improved networks like PANet, NAS-FPN and EfficientNet, which is current state of art detector.

R-FCN: Region-based Fully Convolutional Network (R-FCN) shares almost all computations within the network, unlike previous two stage detectors which applied resource intensive techniques on each proposal.

R-FCN detector is a combination of four convolutional networks. The input image is first passed through the ResNet-101 [21] to get feature maps. An intermediate output (Conv4 layer) is passed to a Region Proposal Network (RPN) to identify RoI proposals while the final output is further processed through a convolutional layer and is input to classifier and regressor. The classification layer combines the generated the position-sensitive map with the RoI proposals to generate predictions while the regression network outputs the bounding box details.

Single Stage Detectors

Single-stage detectors classify and localize semantic objects in a single shot using dense sampling.

YOLO: Two stage detectors solve the object detection as a classification problem, a module presents some candidates which the network classifies as either an object or background.

However, YOLO or You Only Look Once reframed it as a regression problem, directly predicting the image pixels as objects and its bounding box attributes. In YOLO, the input image is divided into a S x S grid and the cell where the object’s center falls is responsible for detecting it. A grid cell predicts multiple bounding boxes, and each prediction array consists of 5 elements: center of bounding box — x and y, dimensions of the box — w and h, and the confidence score.

At training time, grid cells predict only one class as it converges better, but it is be increased during the inference time. Multitask loss, combined loss of all predicted components, is used to optimize the model.

SSD: Single Shot MultiBox Detector was the first single stage detector that matched accuracy of contemporary two stage detectors like Faster R-CNN , while maintaining real time speed.

SSD was built on VGG-16, with additional auxiliary structures to improve performance. These auxiliary convolution layers, added to the end of the model, decrease progressively in size. SSD detects smaller objects earlier in the network when the image features are not too crude, while the deeper layers were responsible for offset of the default boxes and aspect ratios.

YOLOv2 and YOLO9000: YOLOv2 , an improvement on the YOLO, offered an easy tradeoff between speed and accuracy while the YOLO9000 model could predict 9000 object classes in real time. The backbone architecture of GoogLeNet was replaced with DarkNet-19. It incorporated many impressive techniques like Batch Normalization to improve convergence, joint training of classification and detection systems to increase detection classes, removing fully connected layers to increase speed and using learnt anchor boxes to improve recall and have better priors.

RetinaNet: RetinaNet predicts objects by dense sampling of the input image in location, scale and aspect ratio.

Cross entropy loss was reshapped, called Focal loss as the means to remedy the imbalance. Focal loss parameter reduces the loss contribution from easy examples. The network uses ResNet augmented by Feature Pyramid Network (FPN) as the backbone and two similar subnets - classification and bounding box regressor. Each layer from the FPN is passed to the subnets, enabling it to detect objects as various scales. The classification subnet predicts the object score for each location while the box regression subnet regresses the offset for each anchor to the ground truth.

CenterNet: CenterNet predicts the object as a single point at the center of the bounding box.

The input image is passed through the FCN that generates a heatmap, whose peaks correspond to center of detected object. It uses a ImageNet pretrained stacked Hourglass-101 as the feature extractor network and has 3 heads — heatmap head to determine the object center, dimension head to estimate size of object and offset head to correct offset of object point.

Multitask loss of all three heads is back propagated to feature extractor while training. During inference, the output from offset head is used to determine the object point and finally a box is generated.

EfficientDet: EfficientDet builds towards the idea of scalable detector with higher accuracy and efficiency.

It introduces efficient multi-scale features, BiFPN and model scaling. BiFPN is bi-directional feature pyramid network with learnable weights for cross connection of input features at different scales.

EfficientDet utilizes EfficientNet as the backbone network with multiple sets of BiFPN layers stacked in series as feature extraction network. Each output from the final BiFPN layer is sent to class and box prediction network. The model is trained using SGD optimizer along with synchronized batch normalization and uses swish activation

Conclusion

Performance of object detectors is influenced by a number of factors like input image size and scale, feature extractor, GPU architecture, number of proposals, training methodology, loss function etc., which makes it difficult to compare various models without a common benchmark environment.

While the two stage detectors are generally more accurate, they are slow and cannot be used for real-time applications like self- driving cars or security. However, this has changed in the last few year where one stage detectors are equally accurate and much faster than the former.

References

[1] A Survey of Modern Deep Learning based Object Detection Models https://arxiv.org/abs/2104.11892

[2] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled- YOLOv4: Scaling cross stage partial network.” [Online]. Available: http://arxiv.org/abs/2011.08036

[3] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks.” [Online]. Available: http://arxiv.org/ abs/1905.11946

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” vol. 37, no. 9, pp. 1904–1916, conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] S. Qiao, L.-C. Chen, and A. Yuille, “DetectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution.” [Online]. Available: http://arxiv.org/abs/2006.02334