Tagged articles

20 articles

Page 1 of 1

Aug 15, 2025 · Mobile Development

How to Eliminate Text Lag in iOS LLM Chat Apps with Smart Buffering and Typewriter Animation

This article explains how to eliminate stuttered text output in iOS chat applications powered by local LLMs using the MNN framework, by introducing a three‑layer optimization—smart stream buffering, UI update throttling with batch processing, and a typewriter‑style animation—to achieve smooth, near‑online responsiveness.

CLLMMNN

0 likes · 16 min read

How to Eliminate Text Lag in iOS LLM Chat Apps with Smart Buffering and Typewriter Animation

DaTaobao Tech

Apr 21, 2025 · Artificial Intelligence

How MNN LLM Delivers Fast, Stable On‑Device LLM Inference for Android, iOS, and Desktop

Facing DeepSeek R1 server instability, the open‑source MNN LLM framework offers local, mobile‑friendly deployment with model quantization and hardware‑specific optimizations, dramatically improving inference speed, stability, and download reliability across Android, iOS, and desktop platforms while supporting multimodal inputs.

AndroidLLMMNN

0 likes · 11 min read

How MNN LLM Delivers Fast, Stable On‑Device LLM Inference for Android, iOS, and Desktop

DaTaobao Tech

Nov 20, 2024 · Mobile Development

MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment

MNN‑Transformer provides an end‑to‑end framework that enables large language and diffusion models to run efficiently on modern smartphones by exporting, quantizing (including dynamic int4/int8 and KV cache compression) and executing via a plugin‑engine runtime, achieving up to 35 tokens/s decoding and 2‑3× faster image generation compared with existing on‑device solutions.

LLMMNNdiffusion

0 likes · 15 min read

MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment

DaTaobao Tech

Oct 16, 2024 · Artificial Intelligence

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

The article details MNN’s CPU backend dynamic quantization for Transformer‑type models, describing runtime int8 conversion, block‑wise matrix‑multiply optimizations using ARM SMMLA/SDOT and AVX‑512 VNNI, weight‑group and batch‑wise quantization techniques, and reports up to three‑fold speed‑ups on Snapdragon 8 Gen 3.

CPU optimizationDynamic QuantizationINT8

0 likes · 19 min read

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

DaTaobao Tech

Oct 14, 2024 · Artificial Intelligence

MNN Stable Diffusion: On‑Device Deployment and Performance Optimizations

The article presents Alibaba’s open‑source MNN inference engine, demonstrating how quantization, operator fusion (including fused multi‑head attention, GroupNorm/SplitGeLU, Winograd convolutions), optimized GEMM and memory‑paging enable on‑device Stable Diffusion with 1‑second‑per‑step performance on Snapdragon 8 Gen3 and Apple M3 GPUs, and outlines future speed‑up directions.

AIMNNStable Diffusion

0 likes · 11 min read

MNN Stable Diffusion: On‑Device Deployment and Performance Optimizations

DaTaobao Tech

Jan 5, 2024 · Mobile Development

Edge Deployment and Performance Optimization of Large Language Models with MNN

The upgraded mnn‑llm framework adds a unified llm‑export pipeline, cross‑platform inference with tokenizers and disk‑embedding, and ARM‑focused linear‑layer optimizations—including SIMD, hand‑written assembly and 4‑bit quantization—that dramatically speed up prefilling and achieve real‑time LLM conversation on mobile devices within a 2 GB memory budget, outperforming llama.cpp, fastllm and mlc‑llm.

ARM CPULLMMNN

0 likes · 17 min read

Edge Deployment and Performance Optimization of Large Language Models with MNN

DataFunSummit

Sep 11, 2023 · Artificial Intelligence

Challenges and Insights for Deploying Large Models on Edge with MNN

The talk presents an overview of the MNN inference engine, outlines the end‑to‑end workflow for deploying large language models on mobile devices, discusses technical challenges and practical solutions, and concludes with future directions for edge AI deployment.

AIInference EngineLarge Models

0 likes · 2 min read

Challenges and Insights for Deploying Large Models on Edge with MNN

DaTaobao Tech

Jul 12, 2023 · Artificial Intelligence

Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference

The article details a workflow that converts the PyTorch ChatGLM‑6B model to MNN, splits and compresses embeddings, applies int4/int8 quantization, supports dynamic shapes, and uses hybrid GPU/CPU or CPU‑only loading to enable low‑memory edge inference on PCs and mobile devices with competitive token‑per‑second performance.

ChatGLMLLMMNN

0 likes · 16 min read

Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference

DaTaobao Tech

Nov 18, 2022 · Artificial Intelligence

ARMv86 Instruction Set Optimization for MNN: Accelerating Int8 and BF16 Matrix Multiplication

The article explains how ARMv86’s new SMMLA and BFMMLA GEMM instructions are integrated into MNN to accelerate INT8 and BF16 matrix multiplication, delivering up to 90% speedup over ARMv82’s SDOT and FP16‑FMLA kernels through optimized kernels, tiling, and compatibility handling.

ARMv86MNNMatrix Multiplication

0 likes · 15 min read

ARMv86 Instruction Set Optimization for MNN: Accelerating Int8 and BF16 Matrix Multiplication

DaTaobao Tech

Jul 18, 2022 · Artificial Intelligence

Walle: An End-to-End, General-Purpose, Large-Scale Device-Cloud Collaborative Machine Learning System

Walle is Alibaba’s first end‑to‑end, general‑purpose, large‑scale device‑cloud collaborative machine‑learning platform that manages billions of mobile devices, provides a full‑stack data and compute pipeline, cuts cloud load by 87 %, reduces latency to ~100 ms, and already powers over a trillion daily ML invocations across dozens of Alibaba apps.

MNNMachine LearningOSDI

0 likes · 11 min read

Walle: An End-to-End, General-Purpose, Large-Scale Device-Cloud Collaborative Machine Learning System

DaTaobao Tech

Jul 13, 2022 · Artificial Intelligence

MNN 2.0: A Unified Edge‑Cloud Deep Learning Framework Overview

MNN 2.0 transforms Alibaba’s lightweight deep‑learning engine into a unified edge‑cloud framework, delivering ultra‑small binaries, broad model‑format support, and aggressive CPU/GPU/DSP/NPU optimizations—including SIMD, Winograd, quantization, and sparse computation—while providing Python‑style APIs for preprocessing, inference, and on‑device training.

MNNdeep learningedge computing

0 likes · 18 min read

MNN 2.0: A Unified Edge‑Cloud Deep Learning Framework Overview

Alibaba Terminal Technology

Apr 28, 2022 · Artificial Intelligence

How MNN’s Sparse Computing Boosts Mobile AI Inference Performance

This article details the design and implementation of sparse computation in Alibaba’s MNN inference engine, covering weight sparsity techniques, block‑sparse layouts, performance benchmarks on MobileNet models versus XNNPack, and real‑world deployment cases that demonstrate significant speedups and memory savings on mobile CPUs.

AI accelerationMNNblock sparsity

0 likes · 16 min read

How MNN’s Sparse Computing Boosts Mobile AI Inference Performance

DaTaobao Tech

Mar 11, 2022 · Artificial Intelligence

How Alibaba’s MNN Engine Achieves 350% CPU Speedup and Sparse Acceleration

Alibaba’s MNN, a lightweight high‑performance deep‑learning inference engine, earned top honors in China’s 2022 “Science & Innovation China” awards, and delivers impressive gains such as 350% speedup on X86 CPUs, 2.1‑2.3× acceleration on ARM with sparse models, plus integrated OpenCV/Numpy functionality for edge AI deployment.

AI deploymentAlibabaInference Engine

0 likes · 4 min read

How Alibaba’s MNN Engine Achieves 350% CPU Speedup and Sparse Acceleration

DaTaobao Tech

Feb 17, 2022 · Artificial Intelligence

Unifying Edge AI Training and Deployment: Inside MNN Workbench’s New Workflow

The article outlines how MNN Workbench, Alibaba’s open‑source edge‑AI platform, integrates professional training capabilities, cloud‑based PAI‑DLC resources, multi‑window debugging, and visual Git Flow to streamline end‑to‑end model development, deployment, and iteration for developers of varying expertise.

DebuggingDeploymentEdge AI

0 likes · 10 min read

Unifying Edge AI Training and Deployment: Inside MNN Workbench’s New Workflow

Amap Tech

Jun 4, 2021 · Artificial Intelligence

Deploying Multiple CNN Models on Low‑End Devices with MNN: Memory Tricks and Error Debugging

This article explains how a high‑traffic map service captures road features using client‑side computer‑vision models, details the deployment of many CNNs with the lightweight MNN engine on memory‑constrained devices, and shares practical memory‑saving techniques, inference scheduling, and error‑analysis methods.

AndroidEdge AIMNN

0 likes · 12 min read

Deploying Multiple CNN Models on Low‑End Devices with MNN: Memory Tricks and Error Debugging

DataFunTalk

Mar 25, 2021 · Artificial Intelligence

Optimizing MNN Mobile Neural Network Inference on GPU with OpenCL: Memory Objects, Work‑Group Tuning, and Auto‑Tuning

This article explains how the MNN deep‑learning framework leverages OpenCL to achieve high‑performance inference on mobile, PC and embedded GPUs by diversifying memory objects, aligning data, using local‑memory reductions, selecting optimal work‑group sizes, applying pre‑inference auto‑tuning, caching compiled programs, and providing practical GPU‑friendly model design guidelines.

GPU optimizationMNNOpenCL

0 likes · 20 min read

Optimizing MNN Mobile Neural Network Inference on GPU with OpenCL: Memory Objects, Work‑Group Tuning, and Auto‑Tuning

Alibaba Terminal Technology

Jun 28, 2020 · Frontend Development

Accelerating Frontend AI: From WebGL to MNN.js and Beyond

This article explores the rise of AI in front‑end development during the pandemic, compares frameworks like TensorFlow.js, ONNX.js and WebNN, presents a performance‑focused case study of MNN.js, and outlines practical acceleration tools for cross‑platform web and mini‑program AI applications.

FrontendMNNWasm

0 likes · 10 min read

Accelerating Frontend AI: From WebGL to MNN.js and Beyond

Alibaba Cloud Developer

Jan 14, 2020 · Artificial Intelligence

How Alibaba’s Mini‑Program Powers AR Makeup Try‑On with MNN & TensorFlow.js

This article explains how Alibaba’s merchant mini‑program platform integrates Modiface’s AR makeup try‑on engine using JavaScript, MNN and TensorFlow.js, detailing the architecture, real‑time camera handling, inference acceleration, rendering capabilities, and future evolution of AR features for brands.

ARMNNMini Program

0 likes · 13 min read

How Alibaba’s Mini‑Program Powers AR Makeup Try‑On with MNN & TensorFlow.js

Alibaba Cloud Developer

Jul 2, 2019 · Artificial Intelligence

How MNN Powers Mobile AI: Inside Alibaba’s Open‑Source Inference Engine

Alibaba’s MNN (Mobile Neural Network) engine, now open‑sourced on GitHub, showcases how a lightweight, end‑side deep‑learning inference framework tackles fragmentation, optimizes model conversion, scheduling, and execution across diverse devices, delivering significant performance gains for mobile and IoT AI applications.

Inference EngineMNNOperator fusion

0 likes · 15 min read

How MNN Powers Mobile AI: Inside Alibaba’s Open‑Source Inference Engine

Alibaba Cloud Developer

May 7, 2019 · Artificial Intelligence

What Makes Alibaba’s MNN Engine a Game-Changer for Mobile AI Inference?

Alibaba’s open‑source MNN is a lightweight, high‑performance deep‑learning inference engine optimized for edge devices, supporting multiple model formats and backends, offering portability across iOS, Android, and IoT, with detailed architecture, performance benchmarks, roadmap, and real‑world application examples.

Edge AIMNNPerformance Optimization

0 likes · 12 min read

What Makes Alibaba’s MNN Engine a Game-Changer for Mobile AI Inference?