Tagged articles
4 articles
Page 1 of 1
Tencent Technical Engineering
Tencent Technical Engineering
May 25, 2026 · Artificial Intelligence

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

This article walks through a two‑month source‑code study of vLLM, explaining how token‑level scheduling, continuous batching, and the Paged Attention mechanism reshape tensor dimensions to turn large‑model inference into a compute‑bound, high‑throughput process while managing GPU memory efficiently.

FlashAttentionGPU optimizationLLM inference
0 likes · 29 min read
vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 22, 2025 · Artificial Intelligence

The Core Ideas Behind Paged Attention for KV‑Caching

This article explains how Paged Attention, introduced by the vLLM team, applies virtual‑memory techniques, non‑contiguous block mapping, copy‑on‑write reuse, distributed scheduling, and hardware‑level optimizations to improve KV‑cache efficiency and reduce memory fragmentation in large language model serving.

Copy-on-WriteDistributed SchedulingGPU Memory Management
0 likes · 6 min read
The Core Ideas Behind Paged Attention for KV‑Caching
Architect
Architect
Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePerformance OptimizationSpeculative Decoding
0 likes · 23 min read
How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism