EdgeRazor Delivers 15× Faster Decoding on PC & Mobile, Solving Low-Bit Collapse
EdgeRazor, an open‑source framework from Nanjing University and Microsoft AI, uses mixed‑precision quantization‑aware distillation to compress large language models to as low as 1.58‑bit, achieving up to 15× faster decoding on PC and mobile, 10× fewer training tokens, and 7× model size reduction while preserving benchmark performance.
Large language models (LLMs) have grown to hundreds of millions or billions of parameters, causing high memory consumption and compute requirements that prevent deployment on resource‑constrained edge devices such as PCs, smartphones, and IoT hardware. Quantization is the dominant lightweight solution, but it faces an “impossible triangle”: post‑training quantization (PTQ) collapses accuracy at ultra‑low bit‑widths, quantization‑aware training (QAT) demands massive compute, and existing quantization‑aware distillation (QAD) lacks flexibility.
EdgeRazor is an open‑source library jointly released by Nanjing University’s LAMDA institute and Microsoft AI. It introduces mixed‑precision quantization‑aware distillation (MPQAD), a plug‑and‑play framework that supports flexible training‑data ratios. The method is described in the paper “EdgeRazor: A Lightweight Framework for Large Language Models via Mixed‑Precision Quantization‑Aware Distillation” (arXiv:2605.04062). Code is available at https://github.com/zhangsq-nju/EdgeRazor and model collections are hosted on Hugging Face.
Comprehensive evaluation and state‑of‑the‑art performance
EdgeRazor was evaluated on three model families—MobileLLM‑350M (base), Qwen3‑0.6B/1.7B (instruction‑tuned), and Qwen2.5‑Omni‑7B (multimodal)—across 16 downstream tasks covering commonsense reasoning, instruction following, mathematics, code generation, and video understanding. As shown in
, EdgeRazor consistently outperforms PTQ, QAT, and QAD baselines across all model architectures and bit‑width settings, establishing a new SOTA benchmark.
Edge deployment speed
Real‑world CPU demos on a PC and a smartphone show that decoding speed improves by 16× on the PC and 12× on the phone compared with the 16‑bit baseline, while overall end‑to‑end latency gains reach 10× and 11× respectively, delivering an “instant‑response” experience (
).
Breaking the low‑bit collapse
On challenging tasks such as GSM8K (math reasoning) and HumanEval (code generation), existing 2‑bit methods suffer catastrophic performance drops. EdgeRazor, even at an extreme 1.88‑bit budget, maintains robust accuracy and significantly outperforms all 2‑bit competitors, as illustrated in
.
Training efficiency gains
For MobileLLM‑350M, EdgeRazor surpasses the strongest QAT baseline (ParetoQ) and reduces training token consumption from 30 B to as low as 3.1 B—a 75‑90 % reduction—demonstrated in
.
Quantization coverage and compression
Traditional quantization often leaves embedding layers and LM heads unquantized, achieving only 73.89 % parameter coverage. EdgeRazor reaches 99.99 % coverage and attains a 7.03× compression ratio at 1.58‑bit, far exceeding the 2.94× ceiling of prior methods (
).
Resource‑light deployment
Running on an Apple M4 Pro CPU, the 1.58‑bit EdgeRazor model occupies ~190 MB on disk (1/5.8 of the 16‑bit model) and reduces peak memory usage to 1/2.9, enabling deployment on phones and IoT devices.
Hundred‑megabyte footprint : the model fits within typical storage limits of edge hardware.
Fifteen‑fold decoding acceleration : autoregressive decoding speed reaches 15.16× the baseline, achieving “second‑response” latency.
These results are shown in
.
Developer‑friendly design
EdgeRazor’s modular architecture allows zero‑intrusion integration: a few lines of configuration automatically inject quantization into existing full‑precision training pipelines without code refactoring. Training requires three inputs—16‑bit model, mixed data, and a config file—to produce any target low‑bit model.
Code decoupling, plug‑and‑play : no changes to underlying training code are needed.
Minimal configuration, one‑click start : three inputs generate the desired low‑bit model.
Mixed data, free ratio : supports both human‑annotated and synthetic model‑generated data.
Complex low‑level ops automatically handled : the framework loads configurations, injects quantizers (QAT module), and synchronizes distillation losses (KD module) end‑to‑end.
Reduced compute requirements : unlike ParetoQ, which needs 16 GPUs and 30 B tokens, EdgeRazor trains on a single machine with 8 GPUs using only 3.1 B tokens.
Core architecture: three modules for ultra‑low bit
The backbone consists of three innovations:
Structural Quantization with Mixed Precision (SQMP) : breaks the uniform‑bit assumption by allowing fine‑grained mixing of 4‑bit and 1.58‑bit (or intermediate averages such as 1.88‑bit) across input channels. High‑precision 4‑bit rows act as buffers to absorb activation outliers.
Layer‑Adaptive Feature Distillation (LAFD) : computes cosine similarity between adjacent teacher layers to automatically select the top‑k most critical layers for focused distillation, avoiding blind manual tuning and limiting error amplification.
Entropy‑Aware KL Divergence (EAKLD) : leverages the teacher’s output entropy to dynamically balance forward and reverse KL terms, eliminating dependence on teacher‑generated data and enabling seamless mixing of human‑labeled and synthetic data.
These modules are illustrated in
.
Conclusion
EdgeRazor provides a unified algorithmic stack that transforms diverse, large‑scale LLMs into low‑cost, low‑bit versions deployable on memory‑limited edge devices. By bridging low‑cost quantization, efficient training, and cheap deployment, it enables widespread private AI assistants on phones, PCs, and IoT hardware.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
