Artificial Intelligence 11 min read

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

Architects' Tech Alliance

Sep 8, 2024

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

As large‑scale multimodal models evolve from hundred‑billion‑parameter language models to trillion‑parameter systems, clusters with tens of thousands of GPUs require a comprehensive upgrade of underlying compute and networking capabilities.

Byte megascale network : Implements a three‑tier CLOS‑like architecture with over 10,000 GPUs, using Broadcom Tomahawk 5 ASICs (51.2 Tbps per chip, 64 × 800 Gbps ports). Each server has eight 400 Gbps NICs connected to eight independent ToR switches via AOC, maintaining a 1:1 convergence ratio across spine and core layers. Fine‑grained routing and traffic scheduling reduce ECMP hash collisions, and data‑intensive tasks are colocated on the same ToR to minimize hops.

Baidu HPN network : Employs an HPN‑AIPod three‑level CLOS topology with 1:1 convergence, eight NICs per server (each 400 Gbps) linked to eight ToR switches, supporting up to 16 000 GPUs. A SuperSpine interconnects spine switches across pods, while joint‑affinity scheduling and dynamic load balancing (DLB) mitigate hash conflicts and ensure low latency.

Alibaba HPN7 network : Features a two‑plane architecture that abandons the traditional three‑tier CLOS, achieving near‑1:1 oversubscription (1.067:1) with 51.2 Tbps Ethernet ASICs. Each server hosts nine NICs (one front‑end, eight back‑end) each 2 × 200 Gbps, connected to 16 ToR switches. A core layer (15:1 convergence) links multiple pods, and a host‑switch collaboration system provides up‑to‑date link state for non‑overlapping path selection, complemented by an application‑layer load‑balancing scheme that monitors WQE bytes to adjust congestion.

Tencent Xingmai 2.0 network : Built on RoCE, this architecture introduces TiTa and TCCL protocols, supporting over 100 000 GPUs. It uses a Fat‑Tree topology with 1.6 × 10⁴ compute nodes per cluster, each node equipped with eight 400 Gbps NICs (total 3.2 Tbps per node). The design is organized into Block‑Pod‑Cluster hierarchy, and TiTa 2.0 provides end‑to‑end active congestion control and rapid self‑healing of network congestion.

The article also references numerous external resources covering InfiniBand vs. Ethernet, RoCE in HPC, and detailed analyses of high‑performance networking technologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Center GPU clusters High‑Performance Networking InfiniBand RoCE HPN

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.