Tagged articles
6 articles
Page 1 of 1
DeepHub IMBA
DeepHub IMBA
Mar 3, 2026 · Artificial Intelligence

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

KV CacheLLM inferenceMemory Management
0 likes · 16 min read
The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture
Ops Community
Ops Community
Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

GPU optimizationOpenAI API CompatibilityPagedAttention
0 likes · 61 min read
How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching
Infra Learning Club
Infra Learning Club
Apr 4, 2025 · Artificial Intelligence

Testing Augment Code: A Powerful New Rival to Cursor

The article evaluates Augment Code, an AI‑powered coding assistant with 200K token context, persistent memory, multimodal input, and top SWE‑bench scores, walks through its installation, explores its use on vllm and PagedAttention, demonstrates adding a new model and auto‑generating a WeChat mini‑program, and compares its capabilities and speed to Cursor.

AI coding assistantAugment CodeCursor
0 likes · 8 min read
Testing Augment Code: A Powerful New Rival to Cursor
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU MemoryLLM inferencePagedAttention
0 likes · 25 min read
How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference