Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations
The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.
This document summarizes the public talk "Baige AIAK‑LLM: Large‑Model Training and Inference Acceleration Practice" from the 2024 Baidu Create Conference (April 16). The presentation is divided into four parts: (1) challenges that large models pose to AI infrastructure, (2) the key performance metric Model FLOPS Utilization (MFU) and industry techniques to improve it, (3) Baidu Baige’s real‑world case studies that raise MFU to a high level, and (4) a product‑level overview of the AIAK‑LLM acceleration suite.
Background and demand : Model sizes double roughly every 1‑2 years, with upcoming models (e.g., GPT‑5) projected to reach 100 trillion parameters. Data volume grows proportionally, leading to massive compute requirements. Training costs scale with model size, data amount, effective compute efficiency, and hardware cost. Large‑scale, stable, and efficient AI clusters are required to support continuous training and iteration.
MFU (Model FLOPS Utilization) is defined as the ratio of actual FLOPS achieved to the theoretical peak FLOPS of a chip. For example, on an A800 GPU, a throughput of 100 TFLOPS against a peak of 315 TFLOPS yields an MFU of ~32 %. Ideal MFU values are around 75 %+ for training and 30 %+ for inference on A800, based on GEMM‑only performance.
Industry parallelism strategies are described using the BSHL dimensions:
BatchSize (B) – data parallelism.
Sequence length (S) – sequence parallelism.
HiddenSize (H) – tensor parallelism.
Layer count (L) – pipeline parallelism.
Advanced tensor‑parallel schemes (2D, 3D) and pipeline variants (1F1B, interleaved PP, zero‑bubble) are also mentioned. Memory‑optimisation techniques such as ZeRO‑1/2/3, recompute (full, block, selective), and the newly introduced zero‑offload (offloading optimizer/parameters/gradients to CPU) are discussed.
Training optimizations highlighted include:
Overlapping computation and communication for Tensor‑Parallel (TP) to reduce TP communication overhead from ~10 % to ~2 %.
Hybrid recompute strategies that combine full‑block and selective recompute to balance memory saving and extra compute.
Zero‑offload to move optimizer state to host memory, reducing the need for recompute and freeing GPU memory for larger TP/PP configurations.
An adaptive configuration tool that enumerates all possible parallel strategies, predicts compute, memory, and communication costs, and selects the optimal setting within minutes.
Multi‑chip (GPU, Kunlun, Ascend) unified scheduling and accelerator abstraction to achieve high MFU across heterogeneous hardware.
Experimental results show MFU improvements: training MFU reaching ~60 % on 32‑256‑card clusters (versus ~30 % baseline) and inference MFU around 10 % (baseline) improving to >30 % after optimizations. Overall performance gains of >30 % for training and >60 % for inference are reported across various model sizes.
Inference optimizations address two main issues: large token‑gap latency and low GEMM MFU. Solutions include:
Moving sampling and other lightweight operations to GPU.
Parallelizing post‑processing steps (to‑text, to‑client) and redesigning the scheduler for concurrent execution.
Re‑implementing the scheduler in C++ to enable slot‑level parallelism.
Improving GEMM efficiency by handling small‑m dimensions and using a “small‑model‑first” approach to generate multiple tokens in parallel, achieving up to 60 % latency reduction in low‑latency scenarios.
Eliminating padding overhead via 1‑D sequence layout (sequence expansion) and providing extensible hooks for tokenization, preprocessing, and post‑processing.
The presentation concludes with a product overview: Baidu Baige AIAK‑LLM is integrated into Baidu Intelligent Cloud (3.0 version) and offers three layers – resource, component, and model‑acceleration. It provides seamless integration with Hugging Face, checkpoint conversion tools, precision‑alignment utilities, and performance analysis dashboards, enabling customers to achieve “AI for all” with cost‑effective compute.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.