How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceINT4INT8

0 likes · 16 min read

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

DaTaobao Tech

Oct 16, 2024 · Artificial Intelligence

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

The article details MNN’s CPU backend dynamic quantization for Transformer‑type models, describing runtime int8 conversion, block‑wise matrix‑multiply optimizations using ARM SMMLA/SDOT and AVX‑512 VNNI, weight‑group and batch‑wise quantization techniques, and reports up to three‑fold speed‑ups on Snapdragon 8 Gen 3.

CPU optimizationDynamic QuantizationINT8

0 likes · 19 min read

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

DaTaobao Tech

Nov 24, 2023 · Artificial Intelligence

Performance Optimization of Depthwise Conv Int8 on ARM CPUs

By converting the input format to a C16 layout and exploiting the ARM V8.2 Sdot instruction, the Int8 depthwise‑convolution operator on ARM CPUs can be accelerated from 4.46 ms to 1.75 ms—a 2.5× speedup—though the required data‑rearrangement overhead prevents it from overtaking FP16 performance.

ARMDepthwiseConvolutionINT8

0 likes · 10 min read

Performance Optimization of Depthwise Conv Int8 on ARM CPUs