How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

PaperAgent
PaperAgent
PaperAgent
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

Background

Standard Transformers use full‑attention whose compute and memory scale O(N²) with sequence length N, creating a compute wall and a memory wall for million‑token contexts. Pure sparse‑attention reduces compute but still requires a full KV‑cache; pure linear‑attention reduces compute to O(N) but suffers accuracy loss on long‑range dependencies.

Hybrid SALA Architecture

MiniCPM‑SALA combines sparse attention (InfLLM‑V2) and linear attention (Lightning Attention) in a single 8‑B parameter model. Twenty‑five percent of the layers use InfLLM‑V2 for high‑fidelity local modeling with a low KV‑cache, while the remaining seventy‑five percent use Lightning Attention for O(N) global computation. This 75/25 split empirically yields the best trade‑off between efficiency and semantic precision, enabling context windows up to 2 048 K tokens without additional tricks such as YaRN.

Key technical contributions

Mixed attention design (SALA) : First architecture that integrates InfLLM‑V2 and Lightning Attention.

HALO conversion : A lightweight conversion that transforms a pretrained full‑attention Transformer into the mixed architecture, reducing total pre‑training cost to ≈ 25 % of training from scratch.

Hybrid Position Encoding (HyPE) : Linear layers retain RoPE, sparse layers use NoPE, eliminating the long‑range decay of rotary embeddings.

Inference efficiency : 3.5× speed‑up over Qwen3‑8B on 256 K token sequences; can process up to 1 M tokens on consumer‑grade GPUs without out‑of‑memory.

Training pipeline

The training consists of five stages:

HALO conversion : Convert 75 % of layers to linear attention, keep the first and last layers unchanged. Trained on 1.3 B tokens of length 512.

Stable continued training : 314.6 B tokens of length 4 K, sparse attention disabled, learning rate 7.5e‑3.

Short‑Decay phase : 1 T tokens of length 4 K, exponential LR decay to 3.75e‑4, heavy L2‑filtered data and PDF corpora.

Long‑Decay phase : Context window gradually expanded to 32 K, 160 K, then 520 K tokens with 102.2 B + 62.9 B + 50.6 B tokens respectively; sparse attention re‑enabled.

Supervised fine‑tuning (SFT) : High‑quality reasoning, code, math and function‑call data; trained on 64 K and 140 K contexts with 204.5 B + 213.3 B tokens.

Evaluation

On short‑context benchmarks (knowledge QA, math, code generation) MiniCPM‑SALA matches full‑attention 8 B models. On long‑context benchmarks it surpasses them, maintaining stable performance up to 2 048 K tokens without any extra techniques.

Inference speed measured on NVIDIA A6000D (96 GB) and RTX 5090 (32 GB): at 256 K tokens TTFT drops from 180.8 s (Qwen3‑8B) to 51.6 s (MiniCPM‑SALA), a 3.5× acceleration. The model avoids OOM where Qwen3‑8B fails, enabling million‑token processing on consumer GPUs.

Resources

GitHub repository: https://github.com/openbmb/minicpm

HuggingFace model page: https://huggingface.co/openbmb/MiniCPM-SALA

ModelScope: https://www.modelscope.cn/models/OpenBMB/MiniCPM-SALA

Technical report PDF: https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMlong-contextModel architectureSparse AttentionLinear Attention
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.