Artificial Intelligence 12 min read

WeChat "Scan" Object Detection: Mobile AI Model Design, Optimization, and Deployment

The paper presents a lightweight, anchor‑free CenterNet‑based object‑ness detector for WeChat’s Scan feature, built on a ShuffleNetV2 backbone with enlarged 5×5 depth‑wise convolutions, a streamlined detection head, and a Pyramid Interpolation Module, then quantized, ONNX‑converted and NCNN‑deployed to achieve a 436 KB model running in ~15 ms per frame on an iPhone 8 CPU.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
WeChat "Scan" Object Detection: Mobile AI Model Design, Optimization, and Deployment

Background – The "Scan" feature in WeChat provides a convenient way for users to recognize objects by simply pointing the camera, which requires efficient mobile‑side object detection.

Problem – General object detection in open environments demands strong generalization and real‑time performance on mobile devices. The authors define the task as object‑ness detection, focusing only on whether an object exists and its location, without classifying its specific category.

Model Selection – After reviewing many detectors, a one‑stage, anchor‑free architecture is chosen. CenterNet is selected because it uses a single‑head output and Gaussian heat‑map regression, eliminating the need for NMS post‑processing.

Optimization – Large Receptive Field – The backbone is switched from ResNet‑18 to ShuffleNetV2. All depth‑wise 3×3 convolutions are replaced with 5×5 depth‑wise convolutions to enlarge the receptive field with minimal extra cost. Zero‑padding is used to adapt pretrained 3×3 weights to 5×5 kernels.

Optimization – Light Head – The original CenterNet detection head (U‑Net style) is modified: ordinary 3×3 convolutions become 5×5 depth‑wise convolutions, deformable convolutions are replaced with depth‑wise deformable versions, and multi‑head residual connections are changed to channel‑concatenation, reducing computational overhead.

Optimization – Pyramid Interpolation Module (PIM) – To replace deformable convolutions, a PIM inspired by PSPNet’s pyramid pooling is introduced. It fuses multi‑scale features via three branches (dilated de‑convolution, conv + upsample, global average pooling + FC) and provides 2× upsampling.

Deployment – The final model is trained with MMDetection (PyTorch) and converted to ONNX and then to NCNN for mobile inference, with 16‑bit quantization. Conv‑BN‑Scale layers are fused, reducing parameters by ~5% and speeding up inference by 5‑10%. The resulting model is 436 KB and runs at 15 ms per frame on an iPhone 8 A11 CPU.

Outlook – Future work includes addressing the explosion of detection heads with more categories, finding alternatives to deformable convolutions, and further optimizing the U‑Net‑style upsampling.

model optimizationobject detectionmobile AIreal-time inferenceanchor-freeCenterNetShuffleNetV2
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.