Artificial Intelligence 25 min read

PP-LCNet: A Lightweight CPU-Optimized Convolutional Neural Network

PP-LCNet is a lightweight convolutional neural network designed for Intel CPUs that leverages MKLDNN acceleration, H‑Swish activation, selective SE modules, larger kernels, and expanded fully‑connected layers to achieve higher accuracy without increasing inference latency across image classification, detection, and segmentation tasks.

Rare Earth Juejin Tech Community

Apr 22, 2024

PP-LCNet: A Lightweight CPU-Optimized Convolutional Neural Network

PP-LCNet: A Lightweight CPU Convolutional Neural Network

This article introduces PP-LCNet, a lightweight CNN built on MKLDNN acceleration strategies, aiming to improve the performance of lightweight models on various tasks while keeping inference latency minimal.

Abstract

The paper proposes PP-LCNet, a lightweight network accelerated by MKLDNN, which improves accuracy without adding latency. Experiments on ImageNet and downstream tasks (object detection, semantic segmentation) show superior performance compared to prior architectures, with code and pretrained models available on PaddleClas.

1. Introduction

Convolutional neural networks (CNNs) dominate computer‑vision tasks such as image classification, detection, and segmentation. As model capacity grows, fast inference on ARM‑based mobile devices and x86 CPUs becomes challenging. Existing mobile‑friendly models do not run optimally on Intel CPUs with MKLDNN. This work revisits the design of lightweight models for Intel CPUs, focusing on three questions: (1) how to enhance feature representation while keeping latency low, (2) which factors boost accuracy on CPUs, and (3) how to combine design strategies effectively.

The main contribution is a collection of techniques that improve accuracy without increasing inference time, and a set of general principles for designing efficient CNNs on CPUs, providing new insights for NAS researchers.

2. Related Works

Two main streams improve model performance: manually designed CNN architectures and neural architecture search (NAS). Manual designs include VGG, GoogLeNet, MobileNetV1/V2, ShuffleNet, GhostNet, etc. NAS‑based methods such as EfficientNet, MobileNetV3, FBNet, DNANet, OFANet, and MixNet explore automated search spaces, often building on MobileNetV2‑style blocks.

3. Approach

Many lightweight networks perform well on ARM devices but are rarely evaluated on Intel CPUs with MKLDNN. We adopt Depthwise Separable Convolution (DepthSepConv) from MobileNetV1 as the basic module, avoiding shortcuts that hinder CPU speed. Stacking these modules forms a BaseNet, which is then combined with additional techniques to create PP‑LCNet.

3.1 Better activation function

Replacing ReLU with H‑Swish in the BaseNet yields a large performance boost with negligible impact on inference time.

3.2 SE modules at appropriate positions

SE modules improve channel attention but increase CPU latency. Experiments show that placing SE modules only at the network’s tail provides the best accuracy‑speed trade‑off.

3.3 Larger convolution kernels

Using a single large kernel (5×5) only in the final layers, instead of mixing sizes within a layer, maintains low latency while enhancing accuracy.

3.4 Larger dimensional 1×1 conv layer after GAP

After global average pooling, a 1280‑dimensional 1×1 convolution (equivalent to a fully‑connected layer) stores richer features without significantly increasing inference time.

4. Experiment

4.1 Implementation Details

We re‑implemented MobileNetV1/V2/V3, ShuffleNetV2, PicoDet, and Deeplabv3+ in PaddlePaddle. Training used 4 V100 GPUs; CPU tests ran on an Intel Xeon Gold 6148 with batch size 1 and MKLDNN enabled.

4.2 Image Classification

PP‑LCNet was trained on ImageNet‑1k (1.28 M training images, 50 k validation images) using SGD (weight decay 3e‑5, momentum 0.9, batch 2048, cosine learning‑rate schedule for 360 epochs, initial LR 0.8). Standard data augmentations were applied. Results (top‑1/top‑5 accuracy and inference time) show PP‑LCNet outperforms other lightweight models; SSLD distillation further improves accuracy.

4.3 Object Detection

All models were trained on COCO‑2017 (80 classes, 118 k images) using PicoDet as the baseline. PP‑LCNet as backbone achieved higher mAP and faster inference than MobileNetV3.

4.4 Semantic Segmentation

PP‑LCNet was evaluated on Cityscapes using DeeplabV3+ with output stride 32. Compared to MobileNetV3‑large, PP‑LCNet‑0.5x improves mIoU by 2.94 % while reducing inference time by 53 ms; PP‑LCNet‑1x also shows gains.

4.5 Ablation Study

We investigated the impact of SE module placement, large‑kernel positioning, and the cumulative effect of the four techniques. Results confirm that SE modules at the tail, 5×5 kernels at the tail, H‑Swish activation, and a larger post‑GAP fully‑connected layer each contribute to accuracy without notable latency increase.

5. Conclusion and Future Work

We summarized methods for designing lightweight Intel‑CPU networks that improve accuracy without extra latency. PP‑LCNet demonstrates strong performance across classification, detection, and segmentation, and reduces NAS search space. Future work will explore NAS to discover even faster and more powerful models.

References

[1] Alex Krizhevsky et al., "Imagenet classification with deep convolutional neural networks", NIPS 2012. [2] Jia Li et al., "Learning from large scale noisy web data...", IEEE TPAMI 2019. ... (remaining references omitted for brevity) ...

Practical Guide

Task : Classify images as "someone" or "nobody" using PP‑LCNet.

Environment Installation

# CPU only
python3 -m pip install paddlepaddle==2.5.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

# CUDA 10.2
python3 -m pip install paddlepaddle-gpu==2.5.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

# CUDA 11.2
python3 -m pip install paddlepaddle-gpu==2.5.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

# CUDA 11.6
python3 -m pip install paddlepaddle-gpu==2.5.2.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

# CUDA 11.7
python3 -m pip install paddlepaddle-gpu==2.5.2.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

# CUDA 12.0
python3 -m pip install paddlepaddle-gpu==2.5.2.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

Install PaddleClas:

pip install paddleclas

Data

wget https://paddleclas.bj.bcebos.com/data/PULC/person_exists.tar

Labels: 0 – nobody, 1 – someone.

Training

export CUDA_VISIBLE_DEVICES=0,1
python3 -m paddle.distributed.launch \
    --gpus="0,1" \
    tools/train.py \
    -c ./ppcls/configs/PULC/person_exists/PPLCNet_x1_0.yaml

Evaluation

python3 tools/eval.py \
    -c ./ppcls/configs/PULC/person_exists/PPLCNet_x1_0.yaml \
    -o Global.pretrained_model="output/PPLCNet_x1_0/best_model"

Inference

Run the provided script:

python3 tools/infer.py \
    -c ./ppcls/configs/PULC/person_exists/PPLCNet_x1_0.yaml \
    -o Global.pretrained_model=output/PPLCNet_x1_0/best_model

Example output:

[{'class_ids': [1], 'scores': [0.9999976], 'label_names': ['someone'], 'file_name': 'deploy/images/PULC/person_exists/objects365_02035329.jpg'}]

Adjust -o Global.pretrained_model to point to other checkpoints, or set -o Infer.infer_imgs=xxx to predict other images. The default binary classification threshold is 0.5; it can be changed via -o Infer.PostProcess.threshold=0.9794 as needed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CPU optimization model acceleration lightweight CNN MKLDNN

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.