Artificial Intelligence 15 min read

ARMv86 Instruction Set Optimization for MNN: Accelerating Int8 and BF16 Matrix Multiplication

The article explains how ARMv86’s new SMMLA and BFMMLA GEMM instructions are integrated into MNN to accelerate INT8 and BF16 matrix multiplication, delivering up to 90% speedup over ARMv82’s SDOT and FP16‑FMLA kernels through optimized kernels, tiling, and compatibility handling.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
ARMv86 Instruction Set Optimization for MNN: Accelerating Int8 and BF16 Matrix Multiplication

This article discusses the optimization of MNN (a mobile neural network inference engine) using ARMv86 instruction set extensions. The ARMv86 instruction set introduces new general matrix multiplication (GEMM) instructions and BF16 support, theoretically offering twice the performance of ARMv82's SDOT instruction. The article focuses on implementing ConvInt8 and MatMul operators using these new instructions, achieving up to 90% performance improvement.

The technical background section explains that MNN supports various data types including FP32, FP16, BF16, and INT8 to accelerate edge inference and reduce memory usage. ARMv86's new instructions significantly improve INT8 and BF16 computation performance. The article details the specific instructions SMMLA and BFMMLA, their formats, and how they execute GEMM operations for INT8 and BF16 matrices respectively.

The implementation section covers user interface design, compilation compatibility (using .inst for binary instructions), execution compatibility (checking CPU flags), and performance optimization through loop tiling. The article provides detailed code examples for implementing GEMMINT8 and GEMMBF16 kernels, including register allocation strategies and memory access optimization.

Performance comparisons show that SMMLA achieves 88.47% improvement over SDOT for large convolutions, while BFMMLA provides 92.10% improvement over FP16-FMLA for large matrix multiplications. The article concludes with future outlook, noting that BF16 backend still uses FP32 for non-GEMM operations and there's room for further optimization.

Performance Optimizationmobile AIMNNmatrix multiplicationARMv86bf16 accelerationint8 optimizationNeural Network Inference
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.