Fundamentals 20 min read

Quick Guide to ARM Assembly Development: Tips, Bugs, and Performance Optimization

This quick‑start guide walks readers through ARM assembly development by teaching simple template functions, exposing typical parameter‑passing and register bugs with debugging tricks, and demonstrating a depthwise convolution written in assembly that delivers roughly 4.7× faster inference on a Huawei Mate40 Pro compared to its C++ counterpart, while also covering ARM32/ARM64 register conventions, vector instructions, and floating‑point handling.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Quick Guide to ARM Assembly Development: Tips, Bugs, and Performance Optimization

The article shares practical experience on quickly getting started with ARM assembly development, common bugs and debugging methods, and the performance gains of a Convolution Depthwise operator implemented in assembly compared to its C++ version.

Section 1 – Getting Started : Emphasizes learning assembly by studying simple, familiar functions (e.g., MaxPooling) and using existing operators as templates. It explains the structure of an assembly function, parameter passing, and the importance of first implementing a correct C++ version.

Section 2 – Common Bugs and Debugging : Lists typical pitfalls such as incorrect function‑parameter passing, misuse of registers, and size‑type mismatches. Provides debugging tips like using printf to inspect intermediate values and careful register push/pop handling.

Section 3 – Assembly Implementation of Convolution Depthwise : Shows how the assembly version of the operator dramatically speeds up inference on mobile devices (≈4.7× faster on Huawei Mate40 Pro). Includes sample assembly code for function definition, parameter loading, register usage, loops, and vector operations.

Additional topics cover ARM32 vs. ARM64 register rules, vector register usage, floating‑point conversion, and rounding instructions. The article also offers guidance on finding appropriate instructions via ARM documentation and intrinsic references.

Code examples are preserved verbatim and wrapped in ... tags, for instance:

asm_function MNNAvgPoolInt8
// void MNNAvgPoolInt8(int8_t* dst, int8_t* src, size_t outputWidth, ...)
// Auto load: x0: dst, x1: src, x2: outputWidth, x3: inputWidth
// Load from sp: w8: factor

The article concludes with performance comparison tables and a brief team introduction.

debuggingmobileperformanceassemblyARMNeuralNetwork
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.