Artificial Intelligence 23 min read

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.

DataFunSummit
DataFunSummit
DataFunSummit
Large Language Model Inference Overview and Performance Optimizations

The presentation begins with an introduction to large language model (LLM) inference, explaining that each request consists of a prefill phase that processes the entire user input and builds a KV cache, followed by many decoding steps that generate output tokens one by one. Prefill accounts for less than 10% of total latency, while decoding dominates with over 90% of the time.

Four primary performance metrics are defined: Throughput (number of decoding steps per second), First Token Latency (time to complete the prefill and produce the first token), Latency (time per decoding step), and QPS (queries per second). The article discusses how these metrics are measured and why they matter for real‑world LLM services.

Several optimizations are then described:

Pipeline and high‑performance sampling: Separate thread pools handle tokenization, fast sampling, and GPU computation, overlapping CPU and GPU work to gain 10‑20% QPS improvement.

Dynamic batching (Merge Step): Incoming requests are merged with ongoing decoding tasks, allowing simultaneous prefill and decoding on the same GPU batch, potentially doubling QPS.

Decoding Attention: A custom CUDA kernel that processes decoding attention with a fixed query length of 1, offering faster execution than Flash Attention for decoding.

KV‑cache quantization: Group‑wise INT8 quantization reduces KV memory by ~50%, increasing concurrent request capacity by 100%.

Matrix‑multiplication quantization: Per‑channel/per‑token mixed INT8 quantization accelerates GEMM operations by up to 100% while preserving accuracy.

INT8 vs INT4 vs FP8: The trade‑offs between different quantization precisions are examined, showing INT8’s superior performance for server‑side workloads.

Virtual Memory allocator: A page‑style memory manager dynamically expands KV cache as needed, avoiding over‑allocation and improving QPS by ~200%.

The hardware section emphasizes that LLM inference is memory‑bound, recommending GPUs with high bandwidth and large VRAM (e.g., A100, H100) to maximize throughput and minimize latency.

The article concludes with a Q&A session addressing specific questions about Flash Attention, INT4 weight‑only quantization, and KV‑cache de‑quantization, followed by references to open‑source code and additional resources.

performance optimizationquantizationlatencythroughputGPUInference
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.