Large Language Model Inference Overview and Performance Optimizations
This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.
The presentation begins with an introduction to large language model (LLM) inference, explaining that each request consists of a prefill phase that processes the entire user input and builds a KV cache, followed by many decoding steps that generate output tokens one by one. Prefill accounts for less than 10% of total latency, while decoding dominates with over 90% of the time.
Four primary performance metrics are defined: Throughput (number of decoding steps per second), First Token Latency (time to complete the prefill and produce the first token), Latency (time per decoding step), and QPS (queries per second). The article discusses how these metrics are measured and why they matter for real‑world LLM services.
Several optimizations are then described:
Pipeline and high‑performance sampling: Separate thread pools handle tokenization, fast sampling, and GPU computation, overlapping CPU and GPU work to gain 10‑20% QPS improvement.
Dynamic batching (Merge Step): Incoming requests are merged with ongoing decoding tasks, allowing simultaneous prefill and decoding on the same GPU batch, potentially doubling QPS.
Decoding Attention: A custom CUDA kernel that processes decoding attention with a fixed query length of 1, offering faster execution than Flash Attention for decoding.
KV‑cache quantization: Group‑wise INT8 quantization reduces KV memory by ~50%, increasing concurrent request capacity by 100%.
Matrix‑multiplication quantization: Per‑channel/per‑token mixed INT8 quantization accelerates GEMM operations by up to 100% while preserving accuracy.
INT8 vs INT4 vs FP8: The trade‑offs between different quantization precisions are examined, showing INT8’s superior performance for server‑side workloads.
Virtual Memory allocator: A page‑style memory manager dynamically expands KV cache as needed, avoiding over‑allocation and improving QPS by ~200%.
The hardware section emphasizes that LLM inference is memory‑bound, recommending GPUs with high bandwidth and large VRAM (e.g., A100, H100) to maximize throughput and minimize latency.
The article concludes with a Q&A session addressing specific questions about Flash Attention, INT4 weight‑only quantization, and KV‑cache de‑quantization, followed by references to open‑source code and additional resources.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.