Artificial Intelligence 25 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.

Baidu Geek Talk

May 15, 2024

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

This document summarizes the public talk "Baige AIAK‑LLM: Large‑Model Training and Inference Acceleration Practice" from the 2024 Baidu Create Conference (April 16). The presentation is divided into four parts: (1) challenges that large models pose to AI infrastructure, (2) the key performance metric Model FLOPS Utilization (MFU) and industry techniques to improve it, (3) Baidu Baige’s real‑world case studies that raise MFU to a high level, and (4) a product‑level overview of the AIAK‑LLM acceleration suite.

Background and demand : Model sizes double roughly every 1‑2 years, with upcoming models (e.g., GPT‑5) projected to reach 100 trillion parameters. Data volume grows proportionally, leading to massive compute requirements. Training costs scale with model size, data amount, effective compute efficiency, and hardware cost. Large‑scale, stable, and efficient AI clusters are required to support continuous training and iteration.

MFU (Model FLOPS Utilization) is defined as the ratio of actual FLOPS achieved to the theoretical peak FLOPS of a chip. For example, on an A800 GPU, a throughput of 100 TFLOPS against a peak of 315 TFLOPS yields an MFU of ~32 %. Ideal MFU values are around 75 %+ for training and 30 %+ for inference on A800, based on GEMM‑only performance.

Industry parallelism strategies are described using the BSHL dimensions:

BatchSize (B) – data parallelism.

Sequence length (S) – sequence parallelism.

HiddenSize (H) – tensor parallelism.

Layer count (L) – pipeline parallelism.

Advanced tensor‑parallel schemes (2D, 3D) and pipeline variants (1F1B, interleaved PP, zero‑bubble) are also mentioned. Memory‑optimisation techniques such as ZeRO‑1/2/3, recompute (full, block, selective), and the newly introduced zero‑offload (offloading optimizer/parameters/gradients to CPU) are discussed.

Training optimizations highlighted include:

Overlapping computation and communication for Tensor‑Parallel (TP) to reduce TP communication overhead from ~10 % to ~2 %.

Hybrid recompute strategies that combine full‑block and selective recompute to balance memory saving and extra compute.

Zero‑offload to move optimizer state to host memory, reducing the need for recompute and freeing GPU memory for larger TP/PP configurations.

An adaptive configuration tool that enumerates all possible parallel strategies, predicts compute, memory, and communication costs, and selects the optimal setting within minutes.

Multi‑chip (GPU, Kunlun, Ascend) unified scheduling and accelerator abstraction to achieve high MFU across heterogeneous hardware.

Experimental results show MFU improvements: training MFU reaching ~60 % on 32‑256‑card clusters (versus ~30 % baseline) and inference MFU around 10 % (baseline) improving to >30 % after optimizations. Overall performance gains of >30 % for training and >60 % for inference are reported across various model sizes.

Inference optimizations address two main issues: large token‑gap latency and low GEMM MFU. Solutions include:

Moving sampling and other lightweight operations to GPU.

Parallelizing post‑processing steps (to‑text, to‑client) and redesigning the scheduler for concurrent execution.

Re‑implementing the scheduler in C++ to enable slot‑level parallelism.

Improving GEMM efficiency by handling small‑m dimensions and using a “small‑model‑first” approach to generate multiple tokens in parallel, achieving up to 60 % latency reduction in low‑latency scenarios.

Eliminating padding overhead via 1‑D sequence layout (sequence expansion) and providing extensible hooks for tokenization, preprocessing, and post‑processing.

The presentation concludes with a product overview: Baidu Baige AIAK‑LLM is integrated into Baidu Intelligent Cloud (3.0 version) and offers three layers – resource, component, and model‑acceleration. It provides seamless integration with Hugging Face, checkpoint conversion tools, precision‑alignment utilities, and performance analysis dashboards, enabling customers to achieve “AI for all” with cost‑effective compute.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Inference Optimization Large Language Models model acceleration AI infrastructure parallelism Baidu MFU

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.