Design Principles and Practices for High‑Performance AI Compute Center Networks
The article analyzes the limitations of traditional data‑center networking for AI compute workloads and presents high‑bandwidth, non‑blocking, low‑latency design solutions—including two‑layer and three‑layer fat‑tree architectures, AI‑Pool concepts, and recommended configurations—for building scalable, efficient intelligent computing clusters.
Traditional cloud data‑center networks are designed for north‑south traffic from the data center to external customers, with limited east‑west capacity, which poses challenges for supporting AI compute (智算) workloads that require high bandwidth, low latency, and lossless transmission.
Key challenges include blocked networks due to asymmetric uplink/downlink bandwidth ratios, higher intra‑data‑center latency caused by multi‑hop paths across spine switches, and limited NIC bandwidth (typically ≤200 Gbps per server).
For AI compute scenarios, the recommended practice is to build a dedicated high‑performance network that meets the demands for large bandwidth, low latency, and lossless delivery.
High‑Bandwidth Design – AI servers may be equipped with up to eight GPUs and eight PCIe NIC slots; to support inter‑GPU burst bandwidth >50 Gbps, each GPU is typically connected to a ≥100 Gbps network port, achievable with multiple 2 × 100 Gbps NICs or single 200/400 Gbps NICs.
Non‑Blocking Design – Implement a Fat‑Tree architecture with a 1:1 uplink/downlink bandwidth ratio, ensuring that if a leaf switch has 64 × 100 Gbps ports downstream, it also has 64 × 100 Gbps upstream ports, using data‑center‑class switches that provide full‑port non‑blocking forwarding.
Low‑Latency AI‑Pool Design – Baidu Smart Cloud adopts a Rail‑optimized AI‑Pool network where eight access switches form a pool; within the same pool, GPU‑to‑GPU communication requires only one hop, and matching NIC numbers across nodes are connected to the same switch, enabling one‑hop intra‑pool communication.
The AI‑Pool also leverages NCCL Rail‑Local technology to map cross‑GPU traffic to same‑GPU‑number traffic, utilizing NVSwitch bandwidth for efficient multi‑node GPU communication.
Two‑layer Fat‑Tree architecture: eight access switches form an AI‑Pool; each switch can connect to P/2 servers and P/2 upstream switches, supporting up to P × P/2 GPUs.
Three‑layer Fat‑Tree architecture adds aggregation and core switch groups, each with up to P/2 switches, supporting up to P × (P/2) × (P/2) = P³/4 GPUs; for example, a 40‑port HDR InfiniBand switch can support up to 16 000 GPUs, the current domestic maximum.
Comparison of two‑layer vs. three‑layer Fat‑Tree shows significant differences in scalable GPU count and hop count: two‑layer provides 1‑hop same‑GPU communication and 3‑hop different‑GPU communication, while three‑layer requires 3‑hop same‑GPU and 5‑hop different‑GPU paths without AI‑Pool optimizations.
Typical Practices – Recommended configurations based on mature commercial switches:
Regular: InfiniBand two‑layer Fat‑Tree, up to 800 GPUs per cluster.
Large: RoCE two‑layer Fat‑Tree with 128‑port 100 G Ethernet switches, up to 8 192 GPUs.
XLarge: InfiniBand three‑layer Fat‑Tree, up to 16 000 GPUs.
XXLarge: InfiniBand Quantum‑2 or equivalent Ethernet switches, three‑layer Fat‑Tree, up to 100 000 GPUs.
Large‑Scale AI Compute Network Practice – Baidu’s single AI compute cluster can host 8 192 GPUs, with each AI‑Pool supporting 512 GPUs; the non‑blocking, low‑latency, high‑reliability network enables rapid iteration of AI applications.
XLarge‑Scale AI Compute Network Practice – Baidu Smart Cloud designed an InfiniBand network for ultra‑large clusters, using 200 Gbps HDR switches since 2021, providing 1.6 Tbps external bandwidth per GPU server.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.