Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training
The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.
As large‑scale multimodal models evolve from hundred‑billion‑parameter language models to trillion‑parameter systems, clusters with tens of thousands of GPUs require a comprehensive upgrade of underlying compute and networking capabilities.
Byte megascale network : Implements a three‑tier CLOS‑like architecture with over 10,000 GPUs, using Broadcom Tomahawk 5 ASICs (51.2 Tbps per chip, 64 × 800 Gbps ports). Each server has eight 400 Gbps NICs connected to eight independent ToR switches via AOC, maintaining a 1:1 convergence ratio across spine and core layers. Fine‑grained routing and traffic scheduling reduce ECMP hash collisions, and data‑intensive tasks are colocated on the same ToR to minimize hops.
Baidu HPN network : Employs an HPN‑AIPod three‑level CLOS topology with 1:1 convergence, eight NICs per server (each 400 Gbps) linked to eight ToR switches, supporting up to 16 000 GPUs. A SuperSpine interconnects spine switches across pods, while joint‑affinity scheduling and dynamic load balancing (DLB) mitigate hash conflicts and ensure low latency.
Alibaba HPN7 network : Features a two‑plane architecture that abandons the traditional three‑tier CLOS, achieving near‑1:1 oversubscription (1.067:1) with 51.2 Tbps Ethernet ASICs. Each server hosts nine NICs (one front‑end, eight back‑end) each 2 × 200 Gbps, connected to 16 ToR switches. A core layer (15:1 convergence) links multiple pods, and a host‑switch collaboration system provides up‑to‑date link state for non‑overlapping path selection, complemented by an application‑layer load‑balancing scheme that monitors WQE bytes to adjust congestion.
Tencent Xingmai 2.0 network : Built on RoCE, this architecture introduces TiTa and TCCL protocols, supporting over 100 000 GPUs. It uses a Fat‑Tree topology with 1.6 × 10⁴ compute nodes per cluster, each node equipped with eight 400 Gbps NICs (total 3.2 Tbps per node). The design is organized into Block‑Pod‑Cluster hierarchy, and TiTa 2.0 provides end‑to‑end active congestion control and rapid self‑healing of network congestion.
The article also references numerous external resources covering InfiniBand vs. Ethernet, RoCE in HPC, and detailed analyses of high‑performance networking technologies.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.