DaTaobao Tech
Jan 5, 2024 · Mobile Development
Edge Deployment and Performance Optimization of Large Language Models with MNN
The upgraded mnn‑llm framework adds a unified llm‑export pipeline, cross‑platform inference with tokenizers and disk‑embedding, and ARM‑focused linear‑layer optimizations—including SIMD, hand‑written assembly and 4‑bit quantization—that dramatically speed up prefilling and achieve real‑time LLM conversation on mobile devices within a 2 GB memory budget, outperforming llama.cpp, fastllm and mlc‑llm.
ARM CPULLMMNN
0 likes · 17 min read