Tagged articles
2 articles
Page 1 of 1
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 22, 2025 · Artificial Intelligence

The Core Ideas Behind Paged Attention for KV‑Caching

This article explains how Paged Attention, introduced by the vLLM team, applies virtual‑memory techniques, non‑contiguous block mapping, copy‑on‑write reuse, distributed scheduling, and hardware‑level optimizations to improve KV‑cache efficiency and reduce memory fragmentation in large language model serving.

Copy-on-WriteDistributed SchedulingGPU Memory Management
0 likes · 6 min read
The Core Ideas Behind Paged Attention for KV‑Caching
Ops Development & AI Practice
Ops Development & AI Practice
Apr 2, 2025 · Artificial Intelligence

How Cache‑Augmented Generation (CAG) Supercharges LLM Inference

Cache‑Augmented Generation (CAG) speeds up large language model text generation by caching the Transformer attention layer’s key‑value states, dramatically reducing the quadratic compute cost of autoregressive decoding while keeping the model’s knowledge unchanged.

AI PerformanceCAGCache‑augmented generation
0 likes · 9 min read
How Cache‑Augmented Generation (CAG) Supercharges LLM Inference