Artificial Intelligence 15 min read

ARMv86 Instruction Set Optimization for MNN: Accelerating Int8 and BF16 Matrix Multiplication

The article explains how ARMv86’s new SMMLA and BFMMLA GEMM instructions are integrated into MNN to accelerate INT8 and BF16 matrix multiplication, delivering up to 90% speedup over ARMv82’s SDOT and FP16‑FMLA kernels through optimized kernels, tiling, and compatibility handling.

DaTaobao Tech

Nov 18, 2022

ARMv86 Instruction Set Optimization for MNN: Accelerating Int8 and BF16 Matrix Multiplication

This article discusses the optimization of MNN (a mobile neural network inference engine) using ARMv86 instruction set extensions. The ARMv86 instruction set introduces new general matrix multiplication (GEMM) instructions and BF16 support, theoretically offering twice the performance of ARMv82's SDOT instruction. The article focuses on implementing ConvInt8 and MatMul operators using these new instructions, achieving up to 90% performance improvement.

The technical background section explains that MNN supports various data types including FP32, FP16, BF16, and INT8 to accelerate edge inference and reduce memory usage. ARMv86's new instructions significantly improve INT8 and BF16 computation performance. The article details the specific instructions SMMLA and BFMMLA, their formats, and how they execute GEMM operations for INT8 and BF16 matrices respectively.

The implementation section covers user interface design, compilation compatibility (using .inst for binary instructions), execution compatibility (checking CPU flags), and performance optimization through loop tiling. The article provides detailed code examples for implementing GEMMINT8 and GEMMBF16 kernels, including register allocation strategies and memory access optimization.

Performance comparisons show that SMMLA achieves 88.47% improvement over SDOT for large convolutions, while BFMMLA provides 92.10% improvement over FP16-FMLA for large matrix multiplications. The article concludes with future outlook, noting that BF16 backend still uses FP32 for non-GEMM operations and there's room for further optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mobile AI MNN Matrix Multiplication ARMv86 bf16 acceleration int8 optimization Neural Network Inference

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.