Artificial Intelligence 23 min read

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.

DataFunSummit

Apr 10, 2024

Large Language Model Inference Overview and Performance Optimizations

The presentation begins with an introduction to large language model (LLM) inference, explaining that each request consists of a prefill phase that processes the entire user input and builds a KV cache, followed by many decoding steps that generate output tokens one by one. Prefill accounts for less than 10% of total latency, while decoding dominates with over 90% of the time.

Four primary performance metrics are defined: Throughput (number of decoding steps per second), First Token Latency (time to complete the prefill and produce the first token), Latency (time per decoding step), and QPS (queries per second). The article discusses how these metrics are measured and why they matter for real‑world LLM services.

Several optimizations are then described:

Pipeline and high‑performance sampling: Separate thread pools handle tokenization, fast sampling, and GPU computation, overlapping CPU and GPU work to gain 10‑20% QPS improvement.

Dynamic batching (Merge Step): Incoming requests are merged with ongoing decoding tasks, allowing simultaneous prefill and decoding on the same GPU batch, potentially doubling QPS.

Decoding Attention: A custom CUDA kernel that processes decoding attention with a fixed query length of 1, offering faster execution than Flash Attention for decoding.

KV‑cache quantization: Group‑wise INT8 quantization reduces KV memory by ~50%, increasing concurrent request capacity by 100%.

Matrix‑multiplication quantization: Per‑channel/per‑token mixed INT8 quantization accelerates GEMM operations by up to 100% while preserving accuracy.

INT8 vs INT4 vs FP8: The trade‑offs between different quantization precisions are examined, showing INT8’s superior performance for server‑side workloads.

Virtual Memory allocator: A page‑style memory manager dynamically expands KV cache as needed, avoiding over‑allocation and improving QPS by ~200%.

The hardware section emphasizes that LLM inference is memory‑bound, recommending GPUs with high bandwidth and large VRAM (e.g., A100, H100) to maximize throughput and minimize latency.

The article concludes with a Q&A session addressing specific questions about Flash Attention, INT4 weight‑only quantization, and KV‑cache de‑quantization, followed by references to open‑source code and additional resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Latency Throughput Large Language Model GPU inference

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.