Tag

token compression

0 views collected around this technical thread.

Architects' Tech Alliance
Architects' Tech Alliance
Feb 24, 2025 · Artificial Intelligence

NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington

The NSA mechanism introduces a three‑branch hardware‑optimized sparse attention architecture—token compression, token selection, and sliding window—combined with learnable gating to balance global and local context, dramatically improving inference speed and efficiency for long‑context large language models.

AI architectureDeepSeekLarge Language Models
0 likes · 5 min read
NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington