Artificial Intelligence 24 min read

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, detailing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and a series of system-level optimizations—including pipeline parallelism, dynamic batching, specialized attention kernels, virtual memory allocation, KV‑cache quantization, and mixed‑precision strategies—to improve GPU utilization and overall inference efficiency.

DataFunTalk

Feb 19, 2024

Large Language Model Inference Overview and Performance Optimizations

Inference Overview : Large language model (LLM) inference consists of a prefill phase that processes the entire user prompt and builds a KV cache, followed by many decoding steps that generate tokens one by one. Prefill accounts for less than 10% of total latency, while decoding dominates (>90%).

Key Metrics : Performance is measured by Throughput (tokens per second at max load), First Token Latency (time to complete prefill and produce the first token), per‑token Latency (time between successive tokens), and QPS (requests per second). These metrics capture system capacity from different angles.

Hardware Considerations : The choice of GPU (e.g., A100, H100) depends on bandwidth, memory size, and compute power. Since LLM inference is largely memory‑bound, devices with high bandwidth and large VRAM are preferred to maximize throughput.

Optimization 1 – Pipeline & High‑Performance Sampling : Three thread pools separate tokenization, computation, and post‑processing (fast sampling, detokenization). Overlapping these stages hides latency and yields a 10‑20% QPS boost.

Optimization 2 – Dynamic Batching (Merge Step) : When a new request arrives during an ongoing decoding, its prefill is merged with the current decoding batch, creating a combined step that processes both simultaneously, effectively doubling QPS.

Optimization 3 – Decoding Attention : A custom CUDA kernel where query length is 1 eliminates the heavy softmax memory traffic of Flash Attention, providing faster per‑token processing.

Optimization 4 – Virtual‑Memory KV Cache : Instead of allocating a fixed large KV buffer per request, a page‑style allocator grows the cache on‑the‑fly, reducing wasted VRAM and improving QPS by ~200%.

Optimization 5 – KV‑Cache Quantization : Group‑wise INT8 quantization compresses KV data by ~50%, allowing twice as many concurrent requests.

Optimization 6 – Matrix‑Multiplication Quantization : A mixed per‑channel/per‑token quantization pipeline converts FP16 activations to INT8 before GEMM, achieving up to 100% speedup while preserving accuracy.

Optimization 7 – INT8 vs. INT4 vs. FP8 : INT8 offers a 2× reduction in weight loading and compute time without de‑quantization overhead, outperforming INT4 for typical server batch sizes. FP8 is supported on newer GPUs (H100) and provides higher throughput with modest accuracy loss.

Optimization 8 – Non‑Linear Quantization (INT4‑based NF4) : Future work targets weight‑only non‑linear quantization for edge devices where batch size = 1.

Q&A Highlights : Answers clarify why Decoding Attention avoids the softmax bottleneck, how INT4 de‑quantization cost scales with batch size, and that KV‑cache de‑quantization is effectively hidden by memory latency.

All referenced code and data are open‑sourced on GitHub and a shared drive.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Latency Throughput GPU

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.