PP-LCNet: A Lightweight CPU-Optimized Convolutional Neural Network
PP-LCNet is a lightweight convolutional neural network designed for Intel CPUs that leverages MKLDNN acceleration, H‑Swish activation, selective SE modules, larger kernels, and expanded fully‑connected layers to achieve higher accuracy without increasing inference latency across image classification, detection, and segmentation tasks.
PP-LCNet: A Lightweight CPU Convolutional Neural Network
This article introduces PP-LCNet, a lightweight CNN built on MKLDNN acceleration strategies, aiming to improve the performance of lightweight models on various tasks while keeping inference latency minimal.
Abstract
The paper proposes PP-LCNet, a lightweight network accelerated by MKLDNN, which improves accuracy without adding latency. Experiments on ImageNet and downstream tasks (object detection, semantic segmentation) show superior performance compared to prior architectures, with code and pretrained models available on PaddleClas.
1. Introduction
Convolutional neural networks (CNNs) dominate computer‑vision tasks such as image classification, detection, and segmentation. As model capacity grows, fast inference on ARM‑based mobile devices and x86 CPUs becomes challenging. Existing mobile‑friendly models do not run optimally on Intel CPUs with MKLDNN. This work revisits the design of lightweight models for Intel CPUs, focusing on three questions: (1) how to enhance feature representation while keeping latency low, (2) which factors boost accuracy on CPUs, and (3) how to combine design strategies effectively.
The main contribution is a collection of techniques that improve accuracy without increasing inference time, and a set of general principles for designing efficient CNNs on CPUs, providing new insights for NAS researchers.
2. Related Works
Two main streams improve model performance: manually designed CNN architectures and neural architecture search (NAS). Manual designs include VGG, GoogLeNet, MobileNetV1/V2, ShuffleNet, GhostNet, etc. NAS‑based methods such as EfficientNet, MobileNetV3, FBNet, DNANet, OFANet, and MixNet explore automated search spaces, often building on MobileNetV2‑style blocks.
3. Approach
Many lightweight networks perform well on ARM devices but are rarely evaluated on Intel CPUs with MKLDNN. We adopt Depthwise Separable Convolution (DepthSepConv) from MobileNetV1 as the basic module, avoiding shortcuts that hinder CPU speed. Stacking these modules forms a BaseNet, which is then combined with additional techniques to create PP‑LCNet.
3.1 Better activation function
Replacing ReLU with H‑Swish in the BaseNet yields a large performance boost with negligible impact on inference time.
3.2 SE modules at appropriate positions
SE modules improve channel attention but increase CPU latency. Experiments show that placing SE modules only at the network’s tail provides the best accuracy‑speed trade‑off.
3.3 Larger convolution kernels
Using a single large kernel (5×5) only in the final layers, instead of mixing sizes within a layer, maintains low latency while enhancing accuracy.
3.4 Larger dimensional 1×1 conv layer after GAP
After global average pooling, a 1280‑dimensional 1×1 convolution (equivalent to a fully‑connected layer) stores richer features without significantly increasing inference time.
4. Experiment
4.1 Implementation Details
We re‑implemented MobileNetV1/V2/V3, ShuffleNetV2, PicoDet, and Deeplabv3+ in PaddlePaddle. Training used 4 V100 GPUs; CPU tests ran on an Intel Xeon Gold 6148 with batch size 1 and MKLDNN enabled.
4.2 Image Classification
PP‑LCNet was trained on ImageNet‑1k (1.28 M training images, 50 k validation images) using SGD (weight decay 3e‑5, momentum 0.9, batch 2048, cosine learning‑rate schedule for 360 epochs, initial LR 0.8). Standard data augmentations were applied. Results (top‑1/top‑5 accuracy and inference time) show PP‑LCNet outperforms other lightweight models; SSLD distillation further improves accuracy.
4.3 Object Detection
All models were trained on COCO‑2017 (80 classes, 118 k images) using PicoDet as the baseline. PP‑LCNet as backbone achieved higher mAP and faster inference than MobileNetV3.
4.4 Semantic Segmentation
PP‑LCNet was evaluated on Cityscapes using DeeplabV3+ with output stride 32. Compared to MobileNetV3‑large, PP‑LCNet‑0.5x improves mIoU by 2.94 % while reducing inference time by 53 ms; PP‑LCNet‑1x also shows gains.
4.5 Ablation Study
We investigated the impact of SE module placement, large‑kernel positioning, and the cumulative effect of the four techniques. Results confirm that SE modules at the tail, 5×5 kernels at the tail, H‑Swish activation, and a larger post‑GAP fully‑connected layer each contribute to accuracy without notable latency increase.
5. Conclusion and Future Work
We summarized methods for designing lightweight Intel‑CPU networks that improve accuracy without extra latency. PP‑LCNet demonstrates strong performance across classification, detection, and segmentation, and reduces NAS search space. Future work will explore NAS to discover even faster and more powerful models.
References
[1] Alex Krizhevsky et al., "Imagenet classification with deep convolutional neural networks", NIPS 2012. [2] Jia Li et al., "Learning from large scale noisy web data...", IEEE TPAMI 2019. ... (remaining references omitted for brevity) ...
Practical Guide
Task : Classify images as "someone" or "nobody" using PP‑LCNet.
Environment Installation
# CPU only
python3 -m pip install paddlepaddle==2.5.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CUDA 10.2
python3 -m pip install paddlepaddle-gpu==2.5.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CUDA 11.2
python3 -m pip install paddlepaddle-gpu==2.5.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA 11.6
python3 -m pip install paddlepaddle-gpu==2.5.2.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA 11.7
python3 -m pip install paddlepaddle-gpu==2.5.2.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA 12.0
python3 -m pip install paddlepaddle-gpu==2.5.2.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.htmlInstall PaddleClas:
pip install paddleclasData
wget https://paddleclas.bj.bcebos.com/data/PULC/person_exists.tarLabels: 0 – nobody, 1 – someone.
Training
export CUDA_VISIBLE_DEVICES=0,1
python3 -m paddle.distributed.launch \
--gpus="0,1" \
tools/train.py \
-c ./ppcls/configs/PULC/person_exists/PPLCNet_x1_0.yamlEvaluation
python3 tools/eval.py \
-c ./ppcls/configs/PULC/person_exists/PPLCNet_x1_0.yaml \
-o Global.pretrained_model="output/PPLCNet_x1_0/best_model"Inference
Run the provided script:
python3 tools/infer.py \
-c ./ppcls/configs/PULC/person_exists/PPLCNet_x1_0.yaml \
-o Global.pretrained_model=output/PPLCNet_x1_0/best_modelExample output:
[{'class_ids': [1], 'scores': [0.9999976], 'label_names': ['someone'], 'file_name': 'deploy/images/PULC/person_exists/objects365_02035329.jpg'}]Adjust -o Global.pretrained_model to point to other checkpoints, or set -o Infer.infer_imgs=xxx to predict other images. The default binary classification threshold is 0.5; it can be changed via -o Infer.PostProcess.threshold=0.9794 as needed.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.