Artificial Intelligence 18 min read

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

This article examines the main GPU/TPU cluster networking options—including NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC full‑schedule networks—explaining their latency, loss‑less transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

Popular GPU/TPU cluster networking topologies such as NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC network solutions are introduced, with a focus on their interconnection methods and roles in LLM training.

1. End‑to‑end latency: Reducing overall data transfer latency between GPUs shortens training time.

2. Loss‑less transmission: Essential for AI training because any lost gradient forces a rollback to the previous checkpoint.

3. Effective congestion control: In tree topologies, transient and persistent congestion increase tail latency; a single slow link can degrade overall performance.

Additional considerations include total system cost, power consumption, and cooling.

NVLink Switching System

NVLink switches connect eight GPUs within a server and can be used to build inter‑server networks. Nvidia demonstrated a NVSwitch topology linking 32 nodes (256 GPUs) at Hot Chips 2022. NVLink offers higher performance and lower overhead than traditional networks.

The third‑generation NVSwitch provides 64 NVLink ports with up to 12.8 Tbps capacity, supporting multicast and network aggregation to reduce gradient traffic.

In GPT‑3 training, NVSwitch is twice as fast as InfiniBand, but its bandwidth is four times lower than high‑end 51.2 Tbps switches, and scaling beyond 1,000 GPUs is cost‑prohibitive.

InfiniBand Network

InfiniBand (IB) has been a high‑speed alternative since 1999, offering low latency, loss‑less transmission, and RDMA. It is widely used in HPC, AI/ML clusters, and data centers.

IB provides credit‑based flow control for loss‑less transmission and supports congestion notification similar to ECN. All IB switches support RDMA, allowing direct GPU‑to‑GPU memory transfers without CPU involvement.

However, IB switches are more complex to configure, maintain, and scale, especially beyond 32 K GPUs, and require specialized hardware, making them costlier than Ethernet.

ROCE Lossless Ethernet

Ethernet spans from 1 Gbps to 800 Gbps (future 1.6 Tbps). Compared to IB, Ethernet offers higher port speeds and total switching capacity, with lower per‑Gbps cost due to competitive ASIC integration.

High‑end Ethernet ASICs can provide up to 51.2 Tbps switching capacity with 800 Gbps ports, double the throughput of Nvidia Quantum‑2 IB.

ROCE achieves lossless transmission via Priority Flow Control (PFC) supporting eight service classes, and can carry RDMA over UDP/IP (RoCEv2) with end‑to‑end congestion control such as DCQCN.

Load balancing uses ECMP with hash‑based path selection; adaptive strategies reserve extra bandwidth or reroute traffic from congested paths. RoCEv2 can balance packets across links, though out‑of‑order delivery may require NIC support.

DDC Full‑Schedule Network

Recent switch/router chips support full‑schedule fabrics (AI Fabric) using Virtual Output Queues (VOQ). Packets are buffered once at the leaf switch, then scheduled to egress switches, reducing head‑of‑line blocking and incast congestion.

VOQ systems require sufficient ingress buffer per leaf switch and adequate egress buffer to cover round‑trip latency; otherwise utilization drops.

Despite added handshake latency, VOQ fabrics deliver lower tail latency and better scalability for large GPU clusters.

Summary of Main GPU Cluster Networking Technologies

NVLink provides efficient intra‑server GPU communication but is limited in scale.

InfiniBand offers native RDMA, low latency, and loss‑less transmission, suited for medium‑scale clusters with higher cost.

ROCE Ethernet leverages the mature Ethernet ecosystem, lower cost, and rapid bandwidth growth, making it suitable for medium‑to‑large GPU training clusters.

DDC Full‑Schedule Network combines cell‑based switching and VOQ to address Ethernet congestion; it is still in research phases.

For more details, refer to the linked articles and resources.

High Performance ComputingLLM TrainingInfiniBandRoCENVLinkGPU networking
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.