Artificial Intelligence 17 min read

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

The article presents RTPurbo, a lightweight two‑stage training method that converts full‑attention LLMs into highly sparse models with over 97% sparsity, achieving up to 9.36× prefill and 2.01× decode speedups while preserving near‑lossless accuracy across long‑context benchmarks up to 512K tokens.

Machine Learning Algorithms & Natural Language Processing

May 29, 2026

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

Problem Background

Long‑context inference in large language models (LLMs) is limited by the quadratic complexity of full attention. Existing approaches are either training‑free sparse attention that heuristically prunes the attention matrix (often harming accuracy) or native sparse architectures that replace full attention with new modules but require costly pre‑training and can be unstable.

Key Observation

Interpretability studies show that full‑attention models already exhibit strong native sparsity: most attention heads focus on local context, while a small subset—called Retrieval Heads—perform long‑range semantic retrieval.

Can a minimal adaptation (Minimal Surgery) convert a full‑attention model into an efficient sparse model while strictly preserving its original capability?

Challenges

Distinguishing heads that rely on local versus global information.

Efficiently identifying important tokens for globally‑dependent heads.

Achieving high sparsity for any query without sacrificing accuracy.

Proposed Solution: RTPurbo

RTPurbo introduces a lightweight training pipeline (~600 steps, ~1 M label tokens) that activates the model’s inherent sparsity without full‑sparse pre‑training. It delivers 9.36× prefill acceleration and 2.01× decode acceleration while maintaining near‑lossless performance on major long‑text and reasoning benchmarks.

Native Sparse Characteristics

Head‑wise functional differentiation : Retrieval Heads retrieve semantically related distant tokens, unlike the majority of heads that attend locally.

Geometric compressibility under RoPE : Low‑frequency components of the attention spectrum vary smoothly with token distance, enabling accurate reconstruction of Retrieval‑Head attention in a low‑dimensional subspace.

Query‑aware dynamic token budgeting : The optimal token budget depends on query difficulty and task nature, so a static top‑k budget is replaced by a dynamic, query‑aware selection.

Methodology

Offline Head‑wise Calibration

Insert identical “Needle” sequences at the start and end of a long document, compute attention scores from the latter to the former, and rank heads by these scores. Heads with high retrieval scores are classified as Retrieval Heads; the rest are treated as local heads.

Dynamic Sparse Attention

Local heads use a sliding‑window attention with an attention sink (SWA) throughout inference.

Retrieval heads compute full causal attention during the prefilling stage to build the KV cache. During decoding, a query‑aware dynamic sparse selection compresses Q and K with a low‑dimensional projection, reconstructs the attention distribution, and applies a dynamic top‑p mask to the global KV cache.

Two‑Stage Lightweight Training

Stage 1 – Low‑dimensional projection alignment : Freeze the backbone, train only the projection parameters for Retrieval Heads by minimizing KL divergence between projected and original attention distributions.

Stage 2 – End‑to‑end self‑distillation : Enable sparse mode, treat the original full‑attention weights as the teacher, and align the top‑10 logits of the student. Convergence is achieved in a few hundred steps.

Kernel Optimizations

RTPurbo replaces the costly sorting step of conventional top‑p with a histogram‑based, sorting‑free implementation: each compute thread scores a block, atomically updates a 256‑bin global histogram, and a final thread scans the histogram to determine the threshold and generate a block mask, all within a single kernel launch.

Additional bandwidth optimizations include a warp‑level CTA that keeps intermediate states in registers and uses vectorized loads (e.g., half2) to overlap computation and memory latency.

Experimental Evaluation

RTPurbo was evaluated on Qwen3‑Coder‑30B‑A3B (long‑text understanding) and Qwen3‑30B‑A3B‑Think (long chain‑of‑thought reasoning) against state‑of‑the‑art baselines.

Accuracy on Long‑Context Benchmarks

On LongBench and RULER, RTPurbo matches or exceeds full‑attention accuracy while static sparse methods suffer significant drops, especially under 64K context. In chain‑of‑thought tasks (AIME24/25, MMLU‑PRO), RTPurbo achieves the same 86.67 score as the full‑attention teacher, demonstrating negligible accuracy loss even when decoding up to 32K tokens.

Sparsity and Speedup

Dynamic token budgeting adapts to task difficulty: for a “needle‑in‑a‑haystack” task, RTPurbo retains on average 469 active tokens (≈0.09% of a 512K context); for complex multi‑key tasks it expands to ~2.5K tokens, a 5× range. Across contexts up to 512K tokens, sparsity exceeds 97% and accuracy remains robust.

Prefill acceleration grows from 2.83× at 32K tokens to 9.36× at 1M tokens; decode acceleration rises from 1.47× to 2.01× over the same range, outperforming FlashAttention 2 and other sparse strategies.

Sparse‑to‑Dense Trade‑offs

In high‑precision retrieval tasks (e.g., “Galápagos governance”), up to 8K tokens are needed to preserve 90% of attention mass, whereas in “needle‑in‑a‑haystack” scenarios only 2 tokens suffice to cover >96% of attention quality. RTPurbo’s query‑aware dynamic budgeting automatically selects the appropriate token budget, avoiding both under‑allocation (accuracy loss) and over‑allocation (redundant computation).

Conclusion

RTPurbo demonstrates that costly native‑sparse pre‑training is not required for efficient long‑context inference. By minimally adapting full‑attention models through lightweight training and dynamic sparsity mechanisms, high sparsity, substantial speedups, and near‑lossless accuracy are achieved, providing a practical route for deploying large LLMs in resource‑constrained environments.

https://github.com/alibaba/rtp-llm

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM inference Kernel Optimization Sparse Attention Performance Acceleration Dynamic Token Selection Low-Dimensional Projection RTPurbo

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.