Artificial Intelligence 5 min read

LU‑KV Sets New SOTA at ICML 2026 by Redefining KV Cache Eviction

A joint effort by Baidu Baige and Fudan University introduces the LU‑KV framework, which treats KV‑cache budget allocation as a global combinatorial optimization problem, achieving only 0.52% relative performance loss at 80% compression and establishing a new efficiency‑accuracy SOTA on LongBench.

Baidu Intelligent Cloud Tech Hub

Jun 10, 2026

LU‑KV Sets New SOTA at ICML 2026 by Redefining KV Cache Eviction

Large language model (LLM) context windows grow, causing the key‑value (KV) cache size to increase linearly with sequence length. This creates a primary bottleneck for GPU memory usage, inference throughput, and deployment cost.

Existing KV‑cache eviction methods use instantaneous attention scores or key‑vector similarity and assume scores from different attention heads are directly comparable. The authors observed that this “compare current scores” logic ignores the differing long‑term semantic retention capabilities of heads, allocating cache budget to tokens with high short‑term scores but limited long‑range contribution.

The proposed Long‑horizon Utility KV (LU‑KV) framework formulates head‑level KV‑cache budget allocation as a global combinatorial optimization problem that maximizes long‑horizon marginal utility. LU‑KV first performs offline profiling to estimate marginal‑contribution curves for each head under a given compression target. It then applies a convex‑hull relaxation and a marginal‑utility‑based greedy solver to obtain near‑optimal global budget configurations with low computational overhead.

LU‑KV does not replace underlying compression metrics; it acts as a universal budget allocator compatible with methods such as SnapKV and KeyDiff.

Experiments on the long‑context benchmarks LongBench and RULER show stable gains. At an 80 % KV‑cache compression ratio, LU‑KV reduces memory usage and inference latency while incurring minimal performance loss. Using Qwen2.5‑32B on LongBench, the relative performance drop is only 0.52 %, placing the method at a new state‑of‑the‑art point on the efficiency‑accuracy trade‑off curve.

Paper authors: Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen Paper link: https://icml.cc/virtual/2026/poster/65241 Project homepage: https://github.com/baidu-baige/LU-KV

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

efficiency long context LLM inference Cache Eviction KV Cache ICML 2026 LU‑KV

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.