Beyond TurboQuant: Introducing a True 2‑bit KV Quantization for Long‑Context LLM Inference

OSCAR, a new attention‑aware 2‑bit KV cache quantization method, cuts KV memory by up to 8×, delivers up to 3× decode speedup and 7× throughput gain, and matches BF16 accuracy across 4B‑32B models on diverse long‑context reasoning tasks, surpassing TurboQuant.

Machine Heart
Machine Heart
Machine Heart
Beyond TurboQuant: Introducing a True 2‑bit KV Quantization for Long‑Context LLM Inference

The growing length of context in large language models shifts the inference bottleneck from compute to the KV cache, where each generated token must read an ever‑longer history of keys and values. Compressing this history to 2‑bit could theoretically reduce memory by about eightfold, but preserving inference quality and integrating the technique into a real serving stack are challenging.

OSCAR (Offline Spectral Covariance‑Aware Rotation) tackles these challenges by redefining the quantization objective: instead of reconstructing the original K/V vectors, it preserves the directions that the attention mechanism actually consumes. For keys, the rotation target is derived from the query covariance (QᵀQ); for values, it uses a score‑weighted value covariance (VᵀSᵀSV). Offline, a small calibration set estimates these attention‑aware covariances, producing a fixed rotation matrix R = U·Hadamard·bit‑reversal for each layer and head. The Hadamard component spreads out outliers, while the bit‑reversal balances INT2 groups.

OSCAR is not merely a quantization study; it is fully integrated into the SGLang serving framework. The KV cache is split into three segments: a BF16 sink (64 tokens), an INT2‑compressed history (~2.28 BPE), and a BF16 recent window (256 tokens). New tokens are first written to the recent window; as decoding proceeds, the oldest recent tokens are rotated, clipped, quantized, and demoted into the INT2 history, packing four 2‑bit values per byte. During decode, separate kernels handle the BF16 and INT2 segments, followed by an online softmax merge, while remaining compatible with paged KV, radix‑prefix cache, and SGLang’s fused pipeline.

Extensive evaluation on Qwen3‑4B‑Thinking, Qwen3‑8B, Qwen3‑32B, and GLM‑4.7‑FP8 across GPQA, HumanEval, LiveCodeBench v6, AIME25, and MATH500 (up to 32K generation) shows that OSCAR achieves an effective 2.28 bits per KV element, staying within 3.78–1.42 points of BF16 and outperforming TurboQuant by up to 40.1 points on Qwen3‑4B‑Thinking. On 128K long‑context RULER‑NIAH tests, OSCAR maintains stable retrieval performance, confirming its robustness for very long histories.

System‑level gains are equally striking: compared with BF16 history storage, OSCAR reduces KV memory by ~8×, delivers up to 3× decode acceleration in a batch‑size‑1, full‑prefix‑cache‑hit scenario, and boosts job‑level throughput by up to 7× when batch size grows under a fixed memory budget. Higher prefix‑cache hit ratios further amplify these benefits, making OSCAR especially valuable for shared‑prompt agents, multi‑turn dialogues, and tool‑calling loops.

Unlike many low‑bit methods that rely on mixed‑precision tricks for a few sensitive layers, OSCAR keeps the entire historical KV uniformly in INT2, preserving BF16 only for the sink and recent windows. This uniformity simplifies integration with existing paged‑cache and scheduling infrastructures while still delivering near‑BF16 accuracy across models and tasks.

In summary, OSCAR demonstrates that a true 2‑bit KV cache can be both memory‑efficient and production‑ready, offering a practical path to scale long‑context LLM services without sacrificing reasoning quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SGLangKV CacheLLM QuantizationTurboQuantOSCAR2-bit compressionattention-aware rotation
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.