Artificial Intelligence 21 min read

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Bilibili Tech

Jan 21, 2025

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

At the 2024 Global Machine Learning Conference, the rapid growth of large language models (LLMs) and the resulting inference bottlenecks attracted widespread attention. The increasing number of parameters and the complexity of LLMs lead to high computational cost, large memory consumption, and latency issues during the inference phase, especially under limited hardware resources.

The challenges of LLM inference are multi‑faceted: (1) high compute and memory demand, exemplified by models such as LLaMA‑2‑70B requiring multiple high‑end GPUs; (2) a trade‑off between latency and throughput caused by the imbalance between the Prefill and Decode stages; (3) additional cost when extending from single‑modal to multi‑modal tasks (e.g., video or audio processing); and (4) low utilization of compute resources during auto‑regressive decoding.

To address these challenges, researchers propose a stack‑up of optimizations:

1. Operator‑level optimizations

• Operator Fusion : merging multiple operators (e.g., FlashAttention, KVCache, LayerNorm) to reduce memory accesses.

• High‑Performance Acceleration Libraries : using ONNX Runtime, TVM, cuBLAS, FasterTransformer, etc., to speed up common neural‑network kernels.

• Layer Fusion : combining all operations of multi‑head attention into a single kernel (Grouped‑Query, Multi‑Query) to cut data movement.

2. Algorithm‑level optimizations

• Quantization : applying SmoothQuant, AWQ, GPTQ, etc., to reduce weight/activation precision to 8‑bit or lower.

• Speculative Decoding : using a fast small model (e.g., EAGLE, Medusa) to guide the large model, reducing the number of required tokens.

• Sharding : partitioning the model across multiple devices to alleviate memory pressure.

3. Framework‑level optimizations

• Contiguous Batching : keeping requests in a continuous batch to reduce context switches.

• PageAttention : mapping logical KV blocks to physical ones to lower memory fragmentation.

• TensorRT‑LLM and MindIE‑LLM : supporting various attention types (MHA, MQA, GQA) and pipeline/layer parallelism.

Case Study – MindIE‑LLM

The MindIE‑LLM framework (Huawei Ascend) provides a Python API and a C++ scheduler, integrating Continuous Batching, PageAttention, FlashAttention, FlashDecoding, SplitFuse, PD‑separation, and multi‑node communication‑computation fusion. Experimental results show 3‑4× throughput improvement with Continuous Batching, significant latency reduction with SplitFuse, and up to 80% communication time saved in multi‑node setups.

Summary & Outlook

LLM inference acceleration is a full‑stack system engineering problem that requires coordinated optimizations across operators, algorithms, frameworks, and hardware. Future directions include ultra‑low‑bit quantization (4‑bit or lower), dedicated hardware for Prefill and Decode, heterogeneous acceleration (CPU, GPU, FPGA, TPU), intelligent dynamic scheduling, and distributed/edge inference to further lower cost and latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

quantization large language models Inference Acceleration multi-modal continuous batching hardware optimization Operator fusion

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.