Artificial Intelligence 26 min read

Integrating MobileNet Series into YOLOv4 for Efficient Object Detection

This guide explains how to replace YOLOv4's CSPdarknet53 backbone with MobileNetV1, V2, or V3 networks, detailing the architecture analysis, code implementations, training setup, dataset preparation, and inference procedures for building a lightweight object detection model.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Integrating MobileNet Series into YOLOv4 for Efficient Object Detection

This article introduces the concept of using the MobileNet series (V1, V2, V3) as a lightweight backbone for the YOLOv4 object detection model.

It first analyzes the YOLOv4 network architecture, dividing it into three parts—backbone (CSPdarknet53), feature‑enhancement (SPP and PANet), and prediction (YOLO head)—and explains why the first two parts are amenable to replacement.

Next, each MobileNet variant is described. MobileNetV1 relies on depthwise separable convolutions, MobileNetV2 adds inverted residual blocks with linear bottlenecks, and MobileNetV3 combines inverted residuals, SE attention, and h‑swish activation. Illustrative diagrams are referenced for each.

PyTorch implementations of the three MobileNet backbones are provided:

import time
import torch
import torch.nn as nn
import torchvision.models._utils as _utils
import torchvision.models as models
import torch.nn.functional as F
from torch.autograd import Variable

def conv_bn(inp, oup, stride=1):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        nn.ReLU6(inplace=True)
    )

def conv_dw(inp, oup, stride=1):
    return nn.Sequential(
        nn.Conv2d(inp, inp, 3, stride, 1, groups=inp, bias=False),
        nn.BatchNorm2d(inp),
        nn.ReLU6(inplace=True),
        nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
        nn.ReLU6(inplace=True),
    )

class MobileNetV1(nn.Module):
    def __init__(self):
        super(MobileNetV1, self).__init__()
        self.stage1 = nn.Sequential(
            # 640,640,3 -> 320,320,32
            conv_bn(3, 32, 2),
            # 320,320,32 -> 320,320,64
            conv_dw(32, 64, 1),
            # 320,320,64 -> 160,160,128
            conv_dw(64, 128, 2),
            conv_dw(128, 128, 1),
            # 160,160,128 -> 80,80,256
            conv_dw(128, 256, 2),
            conv_dw(256, 256, 1),
        )
        # 80,80,256 -> 40,40,512
        self.stage2 = nn.Sequential(
            conv_dw(256, 512, 2),
            *[conv_dw(512, 512, 1) for _ in range(5)],
        )
        # 40,40,512 -> 20,20,1024
        self.stage3 = nn.Sequential(
            conv_dw(512, 1024, 2),
            conv_dw(1024, 1024, 1),
        )
        self.avg = nn.AdaptiveAvgPool2d((1,1))
        self.fc = nn.Linear(1024, 1000)
    def forward(self, x):
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.avg(x)
        x = x.view(-1, 1024)
        x = self.fc(x)
        return x

def mobilenet_v1(pretrained=False, progress=True):
    model = MobileNetV1()
    if pretrained:
        print("mobilenet_v1 has no pretrained model")
    return model

if __name__ == "__main__":
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = mobilenet_v1().to(device)
    from torchsummary import summary
    summary(model, input_size=(3, 416, 416))

MobileNetV2 and MobileNetV3 implementations follow a similar pattern, using InvertedResidual blocks, SE layers, and h‑swish activation. The full source code for these variants is included in the article.

from torch import nn
import math

class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.ReLU6(inplace=True)
        )

class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.use_res_connect = self.stride == 1 and inp == oup
        layers = []
        if expand_ratio != 1:
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
        layers.extend([
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup),
        ])
        self.conv = nn.Sequential(*layers)
    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)

# MobileNetV2 and MobileNetV3 classes omitted for brevity but follow the same structure.

To embed the chosen MobileNet backbone into YOLOv4, the article defines wrapper classes ( MobileNetV1 , MobileNetV2 , MobileNetV3 ) that expose the three feature maps required by the YOLO head. It also provides utility modules such as SpatialPyramidPooling , Upsample , and functions to build three‑ and five‑layer convolution blocks.

class SpatialPyramidPooling(nn.Module):
    def __init__(self, pool_sizes=[5, 9, 13]):
        super(SpatialPyramidPooling, self).__init__()
        self.maxpools = nn.ModuleList([nn.MaxPool2d(s, 1, s//2) for s in pool_sizes])
    def forward(self, x):
        features = [p(x) for p in self.maxpools[::-1]]
        return torch.cat(features + [x], dim=1)

class Upsample(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(Upsample, self).__init__()
        self.upsample = nn.Sequential(
            conv2d(in_channels, out_channels, 1),
            nn.Upsample(scale_factor=2, mode='nearest')
        )
    def forward(self, x):
        return self.upsample(x)

def make_three_conv(filters_list, in_filters):
    return nn.Sequential(
        conv2d(in_filters, filters_list[0], 1),
        conv_dw(filters_list[0], filters_list[1]),
        conv2d(filters_list[1], filters_list[0], 1)
    )

def make_five_conv(filters_list, in_filters):
    return nn.Sequential(
        conv2d(in_filters, filters_list[0], 1),
        conv_dw(filters_list[0], filters_list[1]),
        conv2d(filters_list[1], filters_list[0], 1),
        conv_dw(filters_list[0], filters_list[1]),
        conv2d(filters_list[1], filters_list[0], 1)
    )

The central YoloBody class assembles the backbone, SPP, up‑sampling paths, and detection heads. It supports selecting the backbone via the backbone argument ("mobilenetv1", "mobilenetv2", or "mobilenetv3") and automatically adjusts channel dimensions.

class YoloBody(nn.Module):
    def __init__(self, num_anchors, num_classes, backbone="mobilenetv2", pretrained=False):
        super(YoloBody, self).__init__()
        if backbone == "mobilenetv1":
            self.backbone = MobileNetV1(pretrained=pretrained)
            in_filters = [256, 512, 1024]
        elif backbone == "mobilenetv2":
            self.backbone = MobileNetV2(pretrained=pretrained)
            in_filters = [32, 96, 320]
        elif backbone == "mobilenetv3":
            self.backbone = MobileNetV3(pretrained=pretrained)
            in_filters = [40, 112, 160]
        else:
            raise ValueError('Unsupported backbone')
        # ... (omitted detailed layer construction for brevity) ...
    def forward(self, x):
        x2, x1, x0 = self.backbone(x)
        # ... (feature‑pyramid construction and detection heads) ...
        return out0, out1, out2

For training, the article walks through preparing a VOC‑style dataset: placing annotations in VOCdevkit/VOC2007/Annotations , images in JPEGImages , generating 2007_train.txt and 2007_val.txt with voc_annotation.py , and configuring train.py parameters such as classes_path , anchors_path , input_shape , backbone , and optional tricks like mosaic augmentation, cosine learning‑rate scheduling, and label smoothing.

Finally, inference is performed with yolo.py and predict.py after updating model_path to the trained checkpoint and classes_path to the class list. The article notes that the backbone choice must match the pretrained weights used during training.

Computer Visiondeep learningObject DetectionMobileNetPyTorchYOLOv4
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.