13 min read

Open‑Source Kwai Summary Attention (KSA): A Sequence‑Compression Mechanism for Long‑Context Inference

KSA inserts learnable summary tokens to compress KV cache by a factor of eight, enabling accurate long‑context retrieval with far lower memory and compute costs, and it consistently outperforms full‑attention and other hybrid methods on large‑scale benchmarks.

Kuaishou Tech

May 14, 2026

Open‑Source Kwai Summary Attention (KSA): A Sequence‑Compression Mechanism for Long‑Context Inference

Long‑sequence modeling in large language models suffers from KV cache memory that grows linearly with sequence length and quadratic compute, making inference on very long contexts prohibitively expensive.

Existing solutions either compress each token’s KV representation (e.g., GQA, MLA) or replace attention with efficient variants (Hybrid‑SWA, Hybrid‑GDN/Linear), but each incurs trade‑offs such as loss of distant information or reduced modeling capacity.

The proposed Kwai Summary Attention (KSA) introduces a learnable Summary token at the end of every chunk of size k. Summary tokens aggregate the semantics of their chunk and are visible only to that chunk, while Text tokens see recent chunks directly and distant chunks through their summary tokens. This reduces KV cache growth from O(N) to O(N/k); with the default k=8, cache usage drops to one‑eighth of the original.

To avoid partial coverage problems of Sliding Window Attention, KSA employs Sliding Chunk Attention (SCA) , moving the attention granularity from token to chunk. Each historical chunk is either fully inside the attention window (original text fully visible) or fully outside (accessed only via its summary), eliminating ambiguous overlap.

The KV cache is organized into three contiguous buffers—Current Chunk, Sliding Chunk Text, and Summary Token Buffer—so that decoding requires a single continuous slice, preserving visibility rules without extra concatenation, gathering, or dynamic masking.

KSA’s token‑count compression is orthogonal to GQA’s head reduction and MLA’s embedding reduction; their compression factors multiply. Combined with GQA, KV cache can be reduced to 0.78% of full attention, and with MLA to 0.22%.

Training uses a three‑stage CPT recipe: (1) attention distillation aligning Q/K/V projections of the summary branch with a full‑attention teacher, (2) parameter annealing with a linearly decayed coefficient λ that smooths summary weights into the main model, and (3) progressive length expansion (32K → 64K → 128K) to adapt the summary mechanism to longer contexts.

Experiments on the RULER long‑retrieval benchmark (4K‑128K) and on CPT‑trained Qwen3‑4B‑Base show that Hybrid‑KSA surpasses Full Attention by 16.60 points on From‑Scratch RULER‑128K and by 5.81 points on CPT, also beating Hybrid‑GDN. On general capabilities, Hybrid‑KSA matches or exceeds Full Attention, achieving top scores on MBPP (62.20), HumanEval (62.50), MATH (+13.54), and GSM8K (+10.85), demonstrating that the summary mechanism provides a beneficial inductive bias for tasks requiring long‑range reasoning.

In summary, KSA offers a middle‑ground solution that retains full long‑context recall while dramatically cutting KV memory and compute, and it integrates seamlessly with other compression techniques. Future work will deepen the convergence of recommendation‑system efficiency ideas and LLM architectures.

Technical report: https://arxiv.org/abs/2604.24432<br/>Open‑source code: https://github.com/Kuaishou-OneRec/KSA

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models Efficient Inference KV cache reduction KSA long-context attention sequence compression

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.