How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.
Background
Standard Transformers use full‑attention whose compute and memory scale O(N²) with sequence length N, creating a compute wall and a memory wall for million‑token contexts. Pure sparse‑attention reduces compute but still requires a full KV‑cache; pure linear‑attention reduces compute to O(N) but suffers accuracy loss on long‑range dependencies.
Hybrid SALA Architecture
MiniCPM‑SALA combines sparse attention (InfLLM‑V2) and linear attention (Lightning Attention) in a single 8‑B parameter model. Twenty‑five percent of the layers use InfLLM‑V2 for high‑fidelity local modeling with a low KV‑cache, while the remaining seventy‑five percent use Lightning Attention for O(N) global computation. This 75/25 split empirically yields the best trade‑off between efficiency and semantic precision, enabling context windows up to 2 048 K tokens without additional tricks such as YaRN.
Key technical contributions
Mixed attention design (SALA) : First architecture that integrates InfLLM‑V2 and Lightning Attention.
HALO conversion : A lightweight conversion that transforms a pretrained full‑attention Transformer into the mixed architecture, reducing total pre‑training cost to ≈ 25 % of training from scratch.
Hybrid Position Encoding (HyPE) : Linear layers retain RoPE, sparse layers use NoPE, eliminating the long‑range decay of rotary embeddings.
Inference efficiency : 3.5× speed‑up over Qwen3‑8B on 256 K token sequences; can process up to 1 M tokens on consumer‑grade GPUs without out‑of‑memory.
Training pipeline
The training consists of five stages:
HALO conversion : Convert 75 % of layers to linear attention, keep the first and last layers unchanged. Trained on 1.3 B tokens of length 512.
Stable continued training : 314.6 B tokens of length 4 K, sparse attention disabled, learning rate 7.5e‑3.
Short‑Decay phase : 1 T tokens of length 4 K, exponential LR decay to 3.75e‑4, heavy L2‑filtered data and PDF corpora.
Long‑Decay phase : Context window gradually expanded to 32 K, 160 K, then 520 K tokens with 102.2 B + 62.9 B + 50.6 B tokens respectively; sparse attention re‑enabled.
Supervised fine‑tuning (SFT) : High‑quality reasoning, code, math and function‑call data; trained on 64 K and 140 K contexts with 204.5 B + 213.3 B tokens.
Evaluation
On short‑context benchmarks (knowledge QA, math, code generation) MiniCPM‑SALA matches full‑attention 8 B models. On long‑context benchmarks it surpasses them, maintaining stable performance up to 2 048 K tokens without any extra techniques.
Inference speed measured on NVIDIA A6000D (96 GB) and RTX 5090 (32 GB): at 256 K tokens TTFT drops from 180.8 s (Qwen3‑8B) to 51.6 s (MiniCPM‑SALA), a 3.5× acceleration. The model avoids OOM where Qwen3‑8B fails, enabling million‑token processing on consumer GPUs.
Resources
GitHub repository: https://github.com/openbmb/minicpm
HuggingFace model page: https://huggingface.co/openbmb/MiniCPM-SALA
ModelScope: https://www.modelscope.cn/models/OpenBMB/MiniCPM-SALA
Technical report PDF: https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
