Artificial Intelligence 27 min read

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

Baidu Geek Talk

Mar 21, 2023

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

This article, derived from the AI Infrastructure session of QCon 2023 (Beijing), discusses how the emergence of large language models (LLMs) such as GPT‑3 (175 billion parameters) and Baidu’s Wenxin model (260 billion parameters) creates unprecedented challenges for computing and storage infrastructure.

Key challenges highlighted include the "compute wall" (e.g., training GPT‑3 would require 314 ZFLOPs, far exceeding a single NVIDIA A100’s 312 TFLOPS) and the "storage wall" (model parameters and intermediate states can require terabytes of memory, far beyond a single GPU’s 80 GB VRAM).

The talk presents Baidu Intelligent Cloud’s full‑stack infrastructure, organized into four layers: model layer (frameworks such as PaddlePaddle, Fleet, PyTorch + DeepSpeed/Megatron), acceleration libraries (AI operators, communication kernels), resource/cluster management, and hardware resources (GPUs, interconnects).

To break the compute and storage walls, several parallelism strategies are described:

Pipeline parallelism – splitting model layers across GPUs to form a pipeline.

Tensor (model) parallelism – partitioning large matrix multiplications across GPUs.

Group‑parameter slicing – reducing per‑GPU parameter redundancy.

Conditional computation (gating) – activating only a subset of expert sub‑networks.

Mixture‑of‑Experts (MoE) – routing tokens to different expert models, which introduces All2All communication.

Real‑world practice is illustrated with a 260 billion‑parameter Transformer trained on Baidu’s platform using a 4‑D hybrid parallelism (pipeline + tensor + data + group‑parameter slicing). The configuration employs eight A100 80 GB GPUs per node, multi‑node CLOS network topology, and careful PCIe/NVLink placement to minimize latency.

Network design details include a three‑tier CLOS architecture (Unit, Leaf, Spine) that reduces hop count for same‑rank GPU communication and supports up to 3 200 GPUs (or 16 000 with InfiniBand). Optimizations for AllReduce and All2All traffic, such as source‑port hashing to avoid RoCE hash collisions and leveraging NVLink‑based Rail‑Local All2All, are discussed.

On the software side, the article covers static‑graph capture, AST‑based code replacement, and tracing (TorchDynamo) to convert dynamic Python models into optimized static graphs. Operator‑level optimizations include hand‑written kernels (cuBLAS/cuDNN), template libraries (CUTLASS), and compiler‑based approaches (Halide, TVM). Operator fusion (e.g., merging GEMM with elementwise ops) and kernel density improvements are emphasized.

Communication optimizations extend beyond AllReduce to All2All, using NCCL’s Rail‑Local All2All and PXN to shift traffic to intra‑node NVLink, as well as enabling InfiniBand’s SHARP offload.

A cost‑model‑driven placement framework is introduced to map split model components to the most suitable hardware, achieving up to 2.1× performance gains. Future directions note continued parameter scaling (up to trillions), multimodal training, and heterogeneous accelerators, requiring unified end‑to‑end cost models and elastic scheduling.

All of these capabilities are integrated into Baidu’s "·AI" heterogeneous computing platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models distributed training AI Infrastructure Model Parallelism GPU clusters Cost Model

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.