Infrastructure Challenges and Solutions for Large‑Scale AI Model Training
The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.
This article, derived from the AI Infrastructure session of QCon 2023 (Beijing), discusses how the emergence of large language models (LLMs) such as GPT‑3 (175 billion parameters) and Baidu’s Wenxin model (260 billion parameters) creates unprecedented challenges for computing and storage infrastructure.
Key challenges highlighted include the "compute wall" (e.g., training GPT‑3 would require 314 ZFLOPs, far exceeding a single NVIDIA A100’s 312 TFLOPS) and the "storage wall" (model parameters and intermediate states can require terabytes of memory, far beyond a single GPU’s 80 GB VRAM).
The talk presents Baidu Intelligent Cloud’s full‑stack infrastructure, organized into four layers: model layer (frameworks such as PaddlePaddle, Fleet, PyTorch + DeepSpeed/Megatron), acceleration libraries (AI operators, communication kernels), resource/cluster management, and hardware resources (GPUs, interconnects).
To break the compute and storage walls, several parallelism strategies are described:
Pipeline parallelism – splitting model layers across GPUs to form a pipeline.
Tensor (model) parallelism – partitioning large matrix multiplications across GPUs.
Group‑parameter slicing – reducing per‑GPU parameter redundancy.
Conditional computation (gating) – activating only a subset of expert sub‑networks.
Mixture‑of‑Experts (MoE) – routing tokens to different expert models, which introduces All2All communication.
Real‑world practice is illustrated with a 260 billion‑parameter Transformer trained on Baidu’s platform using a 4‑D hybrid parallelism (pipeline + tensor + data + group‑parameter slicing). The configuration employs eight A100 80 GB GPUs per node, multi‑node CLOS network topology, and careful PCIe/NVLink placement to minimize latency.
Network design details include a three‑tier CLOS architecture (Unit, Leaf, Spine) that reduces hop count for same‑rank GPU communication and supports up to 3 200 GPUs (or 16 000 with InfiniBand). Optimizations for AllReduce and All2All traffic, such as source‑port hashing to avoid RoCE hash collisions and leveraging NVLink‑based Rail‑Local All2All, are discussed.
On the software side, the article covers static‑graph capture, AST‑based code replacement, and tracing (TorchDynamo) to convert dynamic Python models into optimized static graphs. Operator‑level optimizations include hand‑written kernels (cuBLAS/cuDNN), template libraries (CUTLASS), and compiler‑based approaches (Halide, TVM). Operator fusion (e.g., merging GEMM with elementwise ops) and kernel density improvements are emphasized.
Communication optimizations extend beyond AllReduce to All2All, using NCCL’s Rail‑Local All2All and PXN to shift traffic to intra‑node NVLink, as well as enabling InfiniBand’s SHARP offload.
A cost‑model‑driven placement framework is introduced to map split model components to the most suitable hardware, achieving up to 2.1× performance gains. Future directions note continued parameter scaling (up to trillions), multimodal training, and heterogeneous accelerators, requiring unified end‑to‑end cost models and elastic scheduling.
All of these capabilities are integrated into Baidu’s "·AI" heterogeneous computing platform.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.