Computer Vision

April 8, 2022 • 10 min read

Face Detectors: Understand DSFD and the State-of-the-art Algorithms

Rédigé par Bastien Ponchon

Let’s dive into the recent Dual Shot Face Detector DSFD through a review of two famous detection algorithms: Faster R-CNN and Single Shot Detector.

The State Of The Art in Object Detection

Face detection is a fundamental step for many applications, from recognition to image processing. It is a challenging task, as faces in real-world images present a very high degree of variability in scale, pose, occlusion, expression, appearance, and illumination. Blur, makeup, and reflection are good examples of variability that explain why face detection is still extensively studied.

To this day, the approaches that resulted in the greatest performance can be roughly divided into two categories:

Methods based on region proposal, and more specifically Region Proposal Network.
Single-shot methods such as the Single Shot Detector.

These two architectures are famous and classic algorithms in the broader field of object detection algorithms. Let us dig a little deeper into the two approaches. The aim in object detection is to predict a set of bounding boxes around the objects in an image or a video frame, as well as their respective class.

Note that in the specific field of face detection, there are only two classes: faces and not-faces (or background).

Region-proposal-based methods

The idea of using region proposal to perform object classification was first introduced in 2014, with the R-CNN article. It was based on the observation that a detection task was similar to a classification task, on various regions of the input image.

A simple representation of the idea behind methods based on region-proposal.

The term Region Proposal Network was coined in 2015 by the authors of the Faster R-CNN network and is the core component of this kind of architectures. These methods work as two-stage detection schemes:

A Region Proposal Network hypothesizes possible object locations in the image. This region proposal is class-agnostic: it detects areas that are likely to contain objects instead of just background.
A region-based convolutional network performs class-specific detections: it classifies the objects located in the proposed regions and refines their bounding-box coordinates.

The main components of the Faster R-CNN architecture. The Region Proposal Network outputs coarse regions of interests that are taken into account by the subsequent layers of the architecture to do detection. Here the region based detection is performed by the Fast R-CNN network, which shares some of its convolutional layers with the RPN.

Note that both components may not be fully disconnected networks. Faster R-CNN indeed brought a great improvement in computation time by sharing the layers of a fully convolutional network between the RPN and the class-specific detection network (which is a Fast R-CNN network in the case of the Faster R-CNN architecture).

Details of the Region Proposal Network, as presented in the original article. A convolutional layer with 256 3x3 filters outputs a 256-dimensional vector at each position of the feature map. The latter is then used to classify the corresponding receptive field as object or background for k possible reference boxes, called anchors and to predict an offset to the coordinates of each of these k anchors. cls is a classification layer. reg is a regression layer.

One key concept in the Faster R-CNN architecture is the use of anchor boxes. Anchors are reference boxes with various shapes and scales that will parametrize the k proposed regions at each point of the feature map. At each position of a sliding window over the convolutional feature map, a region is proposed for each anchor:

The regressor layer outputs the coordinates of a refined version of the anchor.
The classifier layer outputs a confidence score in a binary object/background classification task of the anchor.

The common configuration is to have 9 pre-defined anchors, involving three different scales, and three different height-width ratios (usually 1:1, 2:1 and 1:2). As a region is proposed for each anchor box and at each point of the feature map, the output of this step is k⨉(number of points in the feature map). Finally, the proposed regions are filtered out to only keep the best ones (highest object classification score for instance).

In order to train the RPN, we need to determine a matching strategy between the predicted bounding-boxed and the ground-truth bounding boxes. In Faster R-CNN, the predicted bounding boxes are assigned either:

to the ground-truth box with which they achieve the highest Intersection Over Union (IOU, or Jaccard index) overlap.
or to any ground-truth box with which they achieve an IOU overlap higher than 0.7.

Matching strategy between the predicted box (black) and the ground-truth boxes (color).

These predicted boxes are assigned a positive label, that is, their ground truth is considered to be the object class. The predicted boxes that do not meet one of those two criteria and have an IoU lower than 0.3 with all boxes are assigned a negative label. Their ground truth is considered to be the background class. The remaining predicted boxes are ignored.

Once each predicted box has been assigned a label, the RPN network is trained by minimizing the mean over all the predicted boxes of the following loss function:

This loss function is a sum of two other loss functions, ponderated by λ:

The classification loss of the predicted box.
The regression loss of the predicted box. Note that this regression loss is only computed for the positive anchors (p* = 0 for negative anchors), as the negative anchors are not matched with any ground-truth box.

The detector is trained using a similar loss function. The main difference lies in the classification loss. Indeed the detector component performs a multi-class classification task, instead of a binary one.

Single Shot Face Detection

DSFD architecture is mainly based on the 2016 SSD: Single Shot MultiBox Detector architecture, from Wei Liu et al. This architecture differs from RPN-based networks from the fact that there is no region proposal step. The coordinates and the content of the bounding box are directly predicted from the feature map, hence the name of the network, and shorter prediction times.

Additionally, instead of using a single feature map for detection, classifiers and regressors are ran on several feature maps, located at various depth of a core network, as pictured in the figure below. This core network is composed of the layers of the VGG-16 network (truncated before the classification layers) followed by extra-features convolutional layers. If you want a quick overview of the VGG-16 architecture, you can refer to this blog post. Also note that the VGG-16 layers can be replaced with any layers from other fully convolutional networks, such as res-net.

The architecture of the SSD network. Point-wise classifiers and regressors are ran on feature maps at various depth of the base network. The classifiers predict for each point of the feature map a vector of size 84 = (20 classes + background) ⨉ 4 anchors. Similarly, the regressors predict for each point a vector of size 16 = 4 coordinates ⨉ 4 anchors. Hence 4 boxes are predicted for each position in the feature map. For instance, 38⨉38⨉4 = 5776 boxes are predicted for the shallowest feature map, but only 1⨉1⨉4 = 4 for the deepest one.

Each feature map corresponds to various receptive field sizes. The receptive field of a feature map is the area in the input image whose pixels have been involved in the computation of each point of the feature map. The deeper the feature map, the wider the receptive field.

Intuitively, it means that the deep features maps enable the detection of large objects (taking a larger area in the input image), while shallow feature maps enable to detect smaller objects.

As in RPN-based architectures, reference boxes (anchors) are used to parametrize the detection. These boxes are also called priors, as their coordinates are refined by the regressors. In the case of the SSD architecture, a smaller number of anchors is required, only to account for the various possible shapes (width-height ratios) of the bounding boxes, as the detection is already performed at different scales. The scale of the anchors is hence fixed for each feature map and depends on the depth at which we are performing detection.

The number of positive bounding boxes, that is the ones not associated with background by the classifiers, is finally reduced using non-maximum suppression based on the confidence of the classifiers. We will not cover this subject here.

A matching strategy is used to match predicted boxes to ground-truth ones, or to the background class, if they share a small IoU overlap with all ground-truth boxes. The network is then trained using a similar loss function as the ones used by the Faster R-CNN architecture. Note however that there is here no need for a region proposal loss function.

Now that we have reviewed these two famous baselines in object detection, let us not forget that DSFD face detector is the architecture we are interested in. It is time to dive deeper into the novel ideas proposed by the authors.

The Contributions of DSFD

The article introduces three novel alterations to the previous SSD architecture:

A new way of computing the feature maps on which classifications and regressions are conducted.
A variant of the loss functions to be minimized during the training of the architecture.
An improved strategy to match the predictions to the faces in the image.

Feature Enhancement Module

The framework of DSFD is illustrated in the following figure. It uses the same backbone network as SSD network. One key difference here is that the six feature maps at various depth are transformed in six “enhanced” feature maps by a module that the authors call Feature Enhance Module. The objective of this module, which is illustrated hereafter, is to feed the object classifiers and the bounding-box regressors with more flexible and robust features.

DSFD architecture from the original article. Just like in the SSD architecture, 6 feature maps are computed for each input image. They are called the Original Feature Shot. The difference here is that these feature maps are enhanced into an Enhanced Feature Shot using a Feature Enhance Module, which is shown in the figure below. Classifiers and regressors are trained and applied to both the original and the enhanced feature maps, but are not pictured here. First Shot PAL and Second Shot PAL are two loss functions optimized during learning.

Feature Enhance Module from the DSFD architecture. The computation of the enhanced feature map at a given depth of the core architecture requires both the feature map at this depth level and the feature map at the following level (the “up feature map”, which is smaller).

As shown in the figure above, the FEM module begins with an element-wise product of the current level feature map and an up-sampled version of the feature map at the next depth level. The resulting map is split into three parts. Each undergoes a series of one to three dilated convolutions of rate 3, before being concatenated back into a complete feature map of the same size as the input current one. A dilated convolution is a convolution where the kernel is not applied to adjacent pixels but to pixels separated by the rate parameter of the convolution. For more explanations about dilated convolution, you can refer to this blog post.

Both original and enhanced feature maps are fed into classifiers and regressors, similar to the ones in the SSD architecture. At training time, the results of the classifiers and regressors of both shots are used to compute the loss functions to optimize:

First Shot Progressive Anchor Loss (PAL) for the original feature shot.
Second Shot PAL for the enhanced feature shot.

At test time, only the classifiers and regressors of the enhance feature shot are ran and considered as the output of the detection.

Progressive Anchor Loss

As explained in the previous section, during the training of the architecture, the minimized objective of the network is the weighted sum of two loss functions of the set of anchors:

Each term of this objective function is similar to the detection loss functions presented in the sections about RPN-based detection architectures. However here, two different sets of anchors are used:

The set of anchors a is used to compute the Second Shot Loss (right term).
A smaller set of anchors sa is used to compute the First Shot Loss (left term).

Indeed, as each enhanced feature map is computed using both the corresponding original feature map and a one-level deeper original feature map, the receptive fields of the feature maps are wider in the enhanced feature shot (second shot) that in the original shot. On average the faces that can be detected by the original shot are smaller. To account for this difference, the authors use a smaller set of anchors in the computation of the First Shot Loss. Remember that only the features from the second shot are used at prediction/test time.

Just as in SSD, a different anchor scale is used for each feature map. The anchor of the enhanced shot scale twice the anchor of the original shot. Only one shape is used for all anchors, 1:1.5, based on the statistics of faces.

Improved Anchor Matching

One of the main issues in object and face detection is that the choice of anchors usually fails to properly cover the space of possible shapes and scales an object or a face can take. This leads to a significative imbalance between the number of positive and negative detected bounding-boxes in training (much more negative boxes), resulting in a less stable and slower optimization of the regressors and classifiers at each feature map.

To address this issue and make the model more robust to various shapes and sizes, SSD and RPN-based approaches implement various solutions, from sampling the predicted boxes at a fixed positive/negative boxes ratio, to data augmentation strategies to vary the relative scale of the objects in the input image.

The latter approach is the one adopted by DSFD. During training, with a probability of 40%, anchor-based sampling is applied to the input image:

One of the ground-truth faces in the image is randomly selected.
One of the possible anchor scales in the second shot is randomly selected (the anchor scales of the enhanced feature shot are 16, 32, 64, 128, 256, 512).
The image is cropped to a sub-image containing the selected face. The size of the crop is chosen so that the size ratio between sub-image and selected face is 640/S where S is the selected scale.

For the remaining probability, a data augmentation strategy similar to the one used in SSD is applied to the input, that is:

Random crop of the input image (eventually with a minimum overlap with an object). If an object is present, the crop is centered around it.
Resizing to a fixed size.
Random horizontal flip.
Photo-metric distortions.

Results

The DSFD architecture achieves high accuracy on two major face datasets: WIDER and FDDB. As you can see in the following figures, it overcomes many challenges in face detection tasks: occlusion, make-up, reflections, blur, pose…

Illustration of performance of DSFD on various challenges in face detection, from the original paper. Blue bounding boxes are only drawn when the detector (classifier) confidence is above 0.8.

In the article, the authors claimed first rank on both datasets and conducted an ablation study to show the relative impact of each of their three contributions. On all three versions of the WIDER dataset (easy, medium and hard), each contribution came with an improvement in Average Precision:

Between 0.4% (easy WIDER) and 5.5% (hard WIDER) for the Feature Enhancement Module
Between 0.3% and 0.6% for the Progressive Anchor Loss
Between 0.1% and 0.3% for the Improved Anchor Matching

It is quite surprising the authors did not test their algorithm against the classic object detection datasets and challenges, instead of focusing on face detection tasks in their result sections. It is true that huge efforts in Chinese research in computer vision are directed to face detection and recognition, as stated by its omnipresence in the latest great contributions to the field.

What can equal the hasty pace of research in computer vision and deep-learning? As of its publishing at CVPR 2019, DSFD was already no longer ranking first on the WIDER face dataset.

The authors of DSFD claimed best performance on WIDER and FDDB datasets. But research in deep-learning and face detection is a true rat race. DSFD has already been beaten on WIDER face dataset, on which it stands at the second place of the podium…For now!

The site Papers With Code provides great state-of-the-art leaderboards for various research challenges across many domains (such as computer vision). As shown in the figure above, the performance of DSFD on WIDER Face datasets was beaten by two other approaches, since the early months of 2019:

RetinaFace, from the article RetinaFace: Single-stage Dense Face Localisation in the Wild (May 2019)
AInnoFace, from the article Accurate Face Detection for High Performance (May 2019)

Also note that Papers With Code does not compare all the existing methods.

Maybe we will publish an article about those methods too. I hope you enjoyed this article and how it reviewed two major object detection paradigms. Stay tuned!

If you are looking for Computer Vision Experts, don't hesitate to contact us !

Cet article a été écrit par

Bastien Ponchon