Artificial Intelligence 20 min read

Overview of Deep Learning Object Detection Methods and Detailed Implementation of Faster R‑CNN

This article reviews major deep‑learning object detection approaches—including one‑stage YOLO and SSD and two‑stage RCNN, Fast RCNN, and Faster RCNN—then provides a step‑by‑step explanation of Faster RCNN’s architecture, region‑proposal network, RoI pooling, loss functions, and sample PyTorch code.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Overview of Deep Learning Object Detection Methods and Detailed Implementation of Faster R‑CNN

Personal introduction: Liu Ze joined Qunar International ticket‑technology team in 2014 and now works in the strategic development data team on big data and artificial‑intelligence projects, mainly focusing on image and short‑video recognition, detection, and classification.

1. Object Detection Methods in Deep Learning

Object detection predicts both the class and the location (x, y, w, h) of each object. Current methods are divided into one‑stage (e.g., YOLO, SSD) and two‑stage (e.g., RCNN, Fast RCNN, Faster RCNN) approaches.

One‑Stage Methods

YOLO (You Only Look Once) 2015 : The image is resized to 448×448, a single convolution is performed, and the predictions are filtered. YOLO divides the image into an S×S grid and treats detection as a regression problem, outputting a vector such as [pc, px, py, ph, pw, c1, c2, c3] for each cell.

The network uses 1×1 convolutions to reduce feature dimensions. YOLO can run at high fps while maintaining reasonable accuracy.

SSD (Single Shot MultiBox Detector) 2016

SSD improves YOLO by using multi‑scale default boxes and a feature‑pyramid architecture, which increases accuracy for objects of varying sizes. SSD can achieve up to 59 fps, far faster than Faster RCNN’s 7 fps.

Key concepts introduced: IoU (Intersection‑over‑Union) for measuring overlap and NMS (Non‑Maximum Suppression) for filtering redundant boxes.

Two‑Stage Methods

Two‑stage methods add a region‑proposal step before classification.

RCNN (Regions with CNN) 2013

RCNN extracts features with a deep CNN for each region proposal generated by Selective Search, then classifies with an SVM. This pipeline is slow because each region is processed independently.

Fast RCNN 2015

Fast RCNN introduces RoI pooling to share convolutional features across all proposals, reducing computation. It also replaces the SVM with a fully‑connected classification layer, enabling end‑to‑end training with a multi‑task loss.

Faster RCNN 2015

Faster RCNN replaces Selective Search with a Region Proposal Network (RPN) that predicts objectness scores and bounding‑box offsets directly from the shared feature map, greatly speeding up proposal generation.

2. Implementation of Faster RCNN

Faster RCNN consists of three parts:

Backbone network for feature extraction (e.g., ResNet, VGG‑16).

RPN that generates region proposals.

Top‑level network that performs final classification and bounding‑box regression.

2.1 Backbone Network (Feature Extraction)

Example code using VGG‑16 (first 30 layers as features, remaining layers as classifier) and freezing the first 10 convolutional layers:

def decom_vgg16():
    # Use VGG‑16 for feature extraction
    model = vgg16(pretrained=False)
    features = list(model.features)[:30]
    # Remove the last classification layer
    classifier = list(model.classifier)[:-1]
    classifier = nn.Sequential(*classifier)
    # Freeze the first 10 layers
    for layer in features[:10]:
        for p in layer.parameters():
            p.requires_grad = False
    return nn.Sequential(*features), classifier
2.2 Region Proposal Network (RPN)

The RPN consists of a 3×3 convolution followed by two 1×1 convolutions that output objectness scores and bbox regressions:

self.conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1)
self.score = nn.Conv2d(mid_channels, n_anchor * 2, 1, 1, 0)   # objectness
self.loc   = nn.Conv2d(mid_channels, n_anchor * 4, 1, 1, 0)   # bbox offsets

The RPN loss combines a smooth L1 loss for bbox regression and a cross‑entropy loss for objectness:

def _smooth_l1_loss(x, t, in_weight, sigma):
    sigma2 = sigma ** 2
    diff = in_weight * (x - t)
    abs_diff = diff.abs()
    flag = (abs_diff.data < (1. / sigma2)).float()
    flag = Variable(flag)
    y = (flag * (sigma2 / 2) * (diff ** 2) +
         (1 - flag) * (abs_diff - 0.5 / sigma2))
    return y.sum()

rpn_cls_loss = F.cross_entropy(rpn_score, gt_rpn_label.cuda(), ignore_index=-1)

After obtaining rpn_locs and rpn_scores , proposals are filtered with NMS (implemented in pure Python for illustration):

def py_cpu_nms(dets, thresh):
    x1 = dets[:, 0]; y1 = dets[:, 1]
    x2 = dets[:, 2]; y2 = dets[:, 3]
    scores = dets[:, 4]
    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
    order = scores.argsort()[::-1]
    keep = []
    while order.size > 0:
        i = order[0]
        keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h
        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]
    return keep
2.3 RoI Pooling and Final Classification/Regression

RoI pooling converts each proposal to a fixed‑size feature map. The implementation iterates over proposals, extracts the corresponding region from the shared feature map, and applies max‑pooling to obtain a 7×7 (or other) output.

class RoIPool(nn.Module):
    def __init__(self, pooled_height, pooled_width, spatial_scale):
        super(RoIPool, self).__init__()
        self.pooled_height = int(pooled_height)
        self.pooled_width = int(pooled_width)
        self.spatial_scale = float(spatial_scale)
    def forward(self, features, rois):
        batch_size, num_channels, data_h, data_w = features.size()
        num_rois = rois.size(0)
        outputs = Variable(torch.zeros(num_rois, num_channels, self.pooled_height, self.pooled_width)).cuda()
        for roi_ind, roi in enumerate(rois):
            batch_ind = int(roi[0].data[0])
            roi_start_w, roi_start_h, roi_end_w, roi_end_h = np.round(
                roi[1:].data.cpu().numpy() * self.spatial_scale).astype(int)
            roi_width = max(roi_end_w - roi_start_w + 1, 1)
            roi_height = max(roi_end_h - roi_start_h + 1, 1)
            bin_size_w = float(roi_width) / self.pooled_width
            bin_size_h = float(roi_height) / self.pooled_height
            for ph in range(self.pooled_height):
                hstart = int(np.floor(ph * bin_size_h))
                hend = int(np.ceil((ph + 1) * bin_size_h))
                hstart = min(data_h, max(0, hstart + roi_start_h))
                hend = min(data_h, max(0, hend + roi_start_h))
                for pw in range(self.pooled_width):
                    wstart = int(np.floor(pw * bin_size_w))
                    wend = int(np.ceil((pw + 1) * bin_size_w))
                    wstart = min(data_w, max(0, wstart + roi_start_w))
                    wend = min(data_w, max(0, wend + roi_start_w))
                    if hend <= hstart or wend <= wstart:
                        outputs[roi_ind, :, ph, pw] = 0
                    else:
                        data = features[batch_ind]
                        outputs[roi_ind, :, ph, pw] = torch.max(
                            torch.max(data[:, hstart:hend, wstart:wend], 1)[0], 2)[0].view(-1)
        return outputs

After RoI pooling, the pooled features are fed into two fully‑connected layers (each 4096‑dim). One branch predicts class scores, the other predicts bounding‑box offsets:

self.cls_loc = nn.Linear(4096, n_class * 4)   # bbox regression
self.score   = nn.Linear(4096, n_class)       # classification
fc7 = self.classifier(pool)
roi_cls_locs = self.cls_loc(fc7)
roi_scores   = self.score(fc7)

During inference, an input image (e.g., 375×500) is resized to a fixed short side (e.g., 600), anchors are generated (e.g., 9 per spatial location), RPN selects ~300 proposals via NMS, RoI pooling normalizes them, and the top‑level network outputs final detections.

Conclusion

The article introduced major object‑detection techniques—one‑stage YOLO/SSD and two‑stage RCNN variants—and provided a detailed walkthrough of Faster RCNN’s architecture, including backbone selection, RPN design, RoI pooling, loss functions, and sample PyTorch implementations. Object detection remains a vibrant research area with applications in autonomous driving, robotics, and aerial imaging.

References

Felzenszwalb & Huttenlocher, "Efficient Graph‑Based Image Segmentation", IJCV 2004.

Girshick, "Fast R‑CNN", arXiv:1504.08083, 2015.

Ren et al., "Faster R‑CNN: Towards Real‑Time Object Detection with Region Proposal Networks", arXiv:1506.01497, 2015.

Lin et al., "Feature Pyramid Networks for Object Detection", arXiv:1612.03144, 2016.

Dalal & Triggs, "Histograms of Oriented Gradients for Human Detection", CVPR 2005.

He et al., "Mask R‑CNN", arXiv:1703.06870, 2017.

Lowe, "Object recognition from local scale‑invariant features", ICCV 1999.

Uijlings et al., "Selective Search for Object Recognition", IJCV 2013.

He et al., "Deep Residual Learning for Image Recognition", arXiv:1512.03385, 2015.

Krizhevsky et al., "ImageNet classification with deep convolutional neural networks", CACM 2012.

GitHub repositories: jwyang/faster‑rcnn.pytorch, chenyuntc/simple‑faster‑rcnn‑pytorch, facebookresearch/Detectron, rbgirshick/fast‑rcnn, rbgirshick/py‑faster‑rcnn.

Computer VisionPythondeep learningObject DetectionPyTorchFaster R-CNN
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.