Mobile Development 17 min read

Edge Deployment and Performance Optimization of Large Language Models with MNN

The upgraded mnn‑llm framework adds a unified llm‑export pipeline, cross‑platform inference with tokenizers and disk‑embedding, and ARM‑focused linear‑layer optimizations—including SIMD, hand‑written assembly and 4‑bit quantization—that dramatically speed up prefilling and achieve real‑time LLM conversation on mobile devices within a 2 GB memory budget, outperforming llama.cpp, fastllm and mlc‑llm.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Edge Deployment and Performance Optimization of Large Language Models with MNN

Background: The rapid development of large language models (LLMs) has created a demand for efficient edge deployment. The original ChatGLM‑MNN project was upgraded and renamed mnn‑llm , integrating into the MNN framework and supporting multiple mainstream open‑source LLMs.

Model Export: A new tool llm‑export provides a unified pipeline to convert models from their native training format to ONNX and then to MNN, simplifying the export process for diverse LLM architectures.

Deployment: mnn‑llm offers a cross‑platform inference engine with a simple txt2txt interface. It incorporates popular tokenizers (SentencePiece, Tiktoken), disk‑embedding to reduce RAM usage, and an extensible architecture that allows developers to add new LLMs by subclassing a base class.

Performance Analysis: LLM inference on ARM CPUs can be broken down into three core operator groups—Linear, MatMul, and Memory. Profiling shows Linear operations dominate (>93% of time in the prefilling stage) while MatMul and Memory occupy a small fraction. In the decode stage, Linear’s share decreases as cache grows, but it remains the primary bottleneck.

Optimization Strategies: The focus is on accelerating Linear layers. Techniques include using higher‑performance SIMD instruction sets, hand‑written assembly kernels, and 4‑bit weight quantization (W4A8) with data reordering for better memory locality. Example kernel code for ARM’s smmla instruction is shown below.

class LLM(torch.nn.Module):
    def __init__(self, args):
        super().__init__()
        # load tokenizer, embed, blocks, lm
        self.load_model(args.path)

    def forward(self, input_ids, attention_mask, position_ids, past_key_values):
        hidden_states = self.embed(input_ids)
        presents = []
        for i in range(self.block_nums):
            hidden_states, kv = self.blocks[i](hidden_states, attention_mask,
                                               position_ids, past_key_values[i])
            presents.append(kv)
        token_id = self.lm(hidden_states).view(1)
        presents = torch.stack(presents)
        return token_id, presents

    def export(self):
        # export llm to onnx and mnn
        ...

class Chatglm2_6b(LLM):
    def load_model(self, model_path: str):
        # chatglm2 load impl
        ...

class Qwen_7b(LLM):
    def load_model(self, model_path: str):
        # qwen load impl
        ...
LoopSz_TILE_2:
    // src    : 1 x [2 x 8] : v4
    // weight : 4 x [2 x 8] : v0-3
    // dst    : 1 x 4 x [4] : v16-19
    ld1 {v0.16b, v1.16b}, [x25], #32    // weight
    // int4 to int8: v0, v1, v2, v3
    ushr v8.16b, v0.16b, #4
    and v9.16b, v0.16b, v14.16b
    sub v8.16b, v8.16b, v15.16b
    sub v9.16b, v9.16b, v15.16b
    ushr v10.16b, v1.16b, #4
    and v11.16b, v1.16b, v14.16b
    sub v10.16b, v10.16b, v15.16b
    sub v11.16b, v11.16b, v15.16b
    zip1 v0.16b, v8.16b, v9.16b
    zip2 v1.16b, v8.16b, v9.16b
    zip1 v2.16b, v10.16b, v11.16b
    zip2 v3.16b, v10.16b, v11.16b
    ld1 {v4.16b}, [x24], x15   // src
    .inst 0x4e80a490 // smmla v16.4s, v4.16b, v0.16b
    .inst 0x4e81a491 // smmla v17.4s, v4.16b, v1.16b
    .inst 0x4e82a492 // smmla v18.4s, v4.16b, v2.16b
    .inst 0x4e83a493 // smmla v19.4s, v4.16b, v3.16b
    subs x26, x26, #1
    bne LoopSz_TILE_2

Testing: Benchmarks on 1.8B, 6B, and 7B models (with 4‑bit and 8‑bit quantization) were performed on 4‑thread CPUs and Android devices. Compared with llama.cpp, fastllm, and mlc‑llm, mnn‑llm shows a large advantage in prefilling speed on ARM and competitive decode speed on x86.

Conclusion: mnn‑llm enables real‑time LLM conversation on mobile devices with memory usage under 2 GB (e.g., qwen‑1.8b). Larger models still face memory constraints, and GPU performance on mobile is an ongoing optimization target.

Performance OptimizationLLMquantizationMNNARM CPUedge deployment
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.