Cloud Computing 24 min read

Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

The article examines how Alibaba Cloud’s Predictable Network, InfiniBand versus Ethernet trade‑offs, and the HPN high‑performance network design together address the extreme bandwidth, latency, scalability and reliability requirements of modern large‑model AI training workloads in cloud data centers.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

General artificial intelligence is approaching, and the surge of large‑model training demands compute clusters far beyond the capacity of a single chip, making ultra‑large‑scale networking a critical infrastructure.

Alibaba Cloud’s Infrastructure Division introduced the Predictable Network, a QoS‑aware networking system that guarantees throughput and latency for AI, big‑data and HPC workloads, moving beyond the traditional "best‑effort" model.

Large‑model training is highly sensitive to network bandwidth and latency; distributed AI training involves many parallel strategies and frequent synchronization, so any byte‑level delay can prolong or abort jobs.

Comparing InfiniBand (IB) and Ethernet, IB offers lower microsecond‑level latency for small messages and higher raw bandwidth, while Ethernet, after optimizations, narrows the performance gap to within 5% for AI training.

Alibaba’s HPN (High‑Performance Network) architecture introduces dual‑uplink, dual‑plane forwarding, and two‑layer switching to achieve linear scalability, low latency, and robust load‑balancing for clusters exceeding ten thousand GPUs.

Dual‑uplink eliminates single‑point failures and enables seamless link or switch upgrades; dual‑plane forwarding distributes traffic evenly across two network planes, mitigating hash‑polarization in AI workloads.

Adaptive routing is employed in both IB and modern Ethernet chips, with per‑packet and flowlet‑level schemes balancing load while managing packet reordering risks.

Solar‑RDMA, a proprietary high‑performance RDMA protocol, provides multi‑path transmission, fast failover, precise congestion control, and tenant‑aware QoS.

ACCL, Alibaba’s high‑performance collective communication library, optimizes AllReduce by leveraging Ethernet and NVLink, achieving up to 20% performance gains over traditional Ring algorithms.

The C4 collective‑communication scheduler orchestrates concurrent collectives across multiple tasks, reducing overall communication time by 49% and improving GPU utilization by 67%.

Nimitz, a RDMA‑enabled container network, supports up to 15 000 servers in a single Kubernetes cluster, offering rapid provisioning, multi‑plane routing, and seamless integration with serverless workloads.

NUSA, the unified network service platform, automates RDMA deployment, monitoring, fault detection, and remediation, delivering an out‑of‑the‑box RDMA experience.

In summary, while InfiniBand still leads in raw performance, Ethernet’s open ecosystem, cost‑effectiveness, and recent innovations make it the preferred choice for large‑scale AI training in cloud environments.

Figure: Example of GPT‑3 175B training on 128 GPU nodes.

Figure: Parallel strategy illustration.

network architecturecloud computingHigh Performance ComputingAI trainingInfiniBandEthernetPredictable Network
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.