Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing
This article analyzes traditional cloud data center network limitations for AI workloads and compares various high‑bandwidth, low‑latency architectures—including two‑layer and three‑layer fat‑tree designs, InfiniBand, and RoCE—providing best‑practice recommendations for building scalable, non‑blocking AI‑Pool networks.
Traditional cloud data center networks are typically designed for north‑south traffic serving external customers, with limited east‑west capacity, which poses challenges for AI (智算) workloads that require high bandwidth, low latency, and lossless communication.
The article, sourced from "Intelligent Computing Center Network Architecture Selection and Comparison," examines traditional versus AI‑focused networks, two‑layer and three‑layer fat‑tree architectures, and presents best‑practice networking designs.
Challenges in Existing Networks
Blocked networks: Leaf uplink bandwidth is often only one‑third of downlink bandwidth to reduce cost.
Higher internal latency: Communication between servers across leaf switches traverses spine switches, resulting in three hops.
Insufficient bandwidth: Single NICs typically cap at 200 Gbps, limiting per‑machine throughput.
For AI scenarios, a dedicated high‑performance network is recommended to meet the demands of high bandwidth, low latency, and lossless transmission.
High‑Bandwidth Design
AI servers may be equipped with up to eight GPUs and eight PCIe NIC slots. To support burst bandwidth exceeding 50 Gbps between GPUs across machines, each GPU is typically connected to a network port of at least 100 Gbps, achievable with configurations such as 4 × 2 × 100 Gbps NICs, 8 × 1 × 100 Gbps NICs, or 8 × single‑port 200/400 Gbps NICs.
Non‑Blocking Design
The key to a non‑blocking network is the Fat‑Tree architecture, where uplink and downlink bandwidth are 1:1; for example, 64 ports of 100 Gbps on the downlink are matched by 64 ports of 100 Gbps on the uplink.
Data‑center‑grade switches with full‑port non‑blocking forwarding capabilities are required.
Low‑Latency AI‑Pool Design
Baidu Smart Cloud implements an AI‑Pool network optimized with rail‑based technology. Eight access switches form an AI‑Pool; within this two‑layer topology, GPU‑to‑GPU communication inside the same AI‑Pool requires only one hop.
In the AI‑Pool, NICs with the same index across different AI nodes are connected to the same switch, enabling the communication library to map matching GPU numbers to matching NIC numbers, achieving one‑hop intra‑pool connectivity.
For cross‑AI‑Pool communication, traffic passes through aggregation switches, resulting in three hops.
Two‑Layer Fat‑Tree Architecture
Eight access switches constitute an AI‑Pool. Each switch can connect P/2 servers and P/2 switches, allowing a total of P × P/2 GPUs to be attached.
Three‑Layer Fat‑Tree Architecture
The three‑layer design adds aggregation and core switch groups, each supporting up to P/2 switches. This topology can accommodate P × (P/2) × (P/2) = P³/4 GPUs. With a 40‑port 200 Gbps HDR InfiniBand switch, the maximum supported GPUs reach 16 000, a record held by Baidu.
Comparison of Two‑Layer and Three‑Layer Fat‑Tree
GPU Scale: For a 40‑port switch, the two‑layer design supports 800 GPUs, while the three‑layer design supports 16 000 GPUs.
Forwarding Path Hops: In the two‑layer AI‑Pool, same‑GPU‑number traffic hops once; different‑GPU traffic (without Rail Local optimization) hops three times. In the three‑layer design, same‑GPU traffic hops three times, and different‑GPU traffic hops five times without optimization.
Typical Practices
Based on mature commercial switches, the following physical network specifications are recommended:
Regular : InfiniBand two‑layer Fat‑Tree, supporting up to 800 GPUs per cluster.
Large : RoCE two‑layer Fat‑Tree with 128‑port 100 G Ethernet switches, supporting up to 8 192 GPUs.
XLarge : InfiniBand three‑layer Fat‑Tree with HDR switches, supporting up to 16 000 GPUs.
XXLarge : InfiniBand Quantum‑2 or equivalent Ethernet switches in a three‑layer Fat‑Tree, supporting up to 100 000 GPUs.
Large‑Scale AI Computing Network Practice
In a 8 192‑GPU cluster, each AI‑Pool can host 512 GPUs. The non‑blocking, low‑latency, high‑reliability network enables rapid iteration of AI applications.
XLarge‑Scale AI Computing Network Practice
Baidu Smart Cloud’s ultra‑large cluster uses 200 Gbps InfiniBand HDR switches, providing 1.6 Tbps external bandwidth per GPU server.
Related Reading
InfiniBand: Can It Displace Ethernet?
NVIDIA Quantum‑2 InfiniBand Platform Q&A
Jericho3‑AI Chip: A Potential InfiniBand Alternative?
RoCE Technology in HPC: Analysis and Application
GPU Cluster: NVLink, InfiniBand, RoCE, DDC Technical Analysis
InfiniBand High‑Performance Network Design Overview
Understanding InfiniBand and RoCE Network Technologies
Industrial Switch Research Framework (2024)
Core Switch Link Aggregation, Redundancy, Stacking, Hot‑Backup
High‑Performance Computing: Key Component Knowledge
High‑Performance Computing: RoCE v2 vs. InfiniBand Selection
High‑Performance Computing: The Overlooked National Strategic Asset
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.