Artificial Intelligence 18 min read

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

This article examines the main GPU/TPU cluster networking options—including NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC full‑schedule networks—explaining their latency, loss‑less transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

Architects' Tech Alliance

Dec 24, 2023

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

Popular GPU/TPU cluster networking topologies such as NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC network solutions are introduced, with a focus on their interconnection methods and roles in LLM training.

1. End‑to‑end latency: Reducing overall data transfer latency between GPUs shortens training time.

2. Loss‑less transmission: Essential for AI training because any lost gradient forces a rollback to the previous checkpoint.

3. Effective congestion control: In tree topologies, transient and persistent congestion increase tail latency; a single slow link can degrade overall performance.

Additional considerations include total system cost, power consumption, and cooling.

NVLink Switching System

NVLink switches connect eight GPUs within a server and can be used to build inter‑server networks. Nvidia demonstrated a NVSwitch topology linking 32 nodes (256 GPUs) at Hot Chips 2022. NVLink offers higher performance and lower overhead than traditional networks.

The third‑generation NVSwitch provides 64 NVLink ports with up to 12.8 Tbps capacity, supporting multicast and network aggregation to reduce gradient traffic.

In GPT‑3 training, NVSwitch is twice as fast as InfiniBand, but its bandwidth is four times lower than high‑end 51.2 Tbps switches, and scaling beyond 1,000 GPUs is cost‑prohibitive.

InfiniBand Network

InfiniBand (IB) has been a high‑speed alternative since 1999, offering low latency, loss‑less transmission, and RDMA. It is widely used in HPC, AI/ML clusters, and data centers.

IB provides credit‑based flow control for loss‑less transmission and supports congestion notification similar to ECN. All IB switches support RDMA, allowing direct GPU‑to‑GPU memory transfers without CPU involvement.

However, IB switches are more complex to configure, maintain, and scale, especially beyond 32 K GPUs, and require specialized hardware, making them costlier than Ethernet.

ROCE Lossless Ethernet

Ethernet spans from 1 Gbps to 800 Gbps (future 1.6 Tbps). Compared to IB, Ethernet offers higher port speeds and total switching capacity, with lower per‑Gbps cost due to competitive ASIC integration.

High‑end Ethernet ASICs can provide up to 51.2 Tbps switching capacity with 800 Gbps ports, double the throughput of Nvidia Quantum‑2 IB.

ROCE achieves lossless transmission via Priority Flow Control (PFC) supporting eight service classes, and can carry RDMA over UDP/IP (RoCEv2) with end‑to‑end congestion control such as DCQCN.

Load balancing uses ECMP with hash‑based path selection; adaptive strategies reserve extra bandwidth or reroute traffic from congested paths. RoCEv2 can balance packets across links, though out‑of‑order delivery may require NIC support.

DDC Full‑Schedule Network

Recent switch/router chips support full‑schedule fabrics (AI Fabric) using Virtual Output Queues (VOQ). Packets are buffered once at the leaf switch, then scheduled to egress switches, reducing head‑of‑line blocking and incast congestion.

VOQ systems require sufficient ingress buffer per leaf switch and adequate egress buffer to cover round‑trip latency; otherwise utilization drops.

Despite added handshake latency, VOQ fabrics deliver lower tail latency and better scalability for large GPU clusters.

Summary of Main GPU Cluster Networking Technologies

NVLink provides efficient intra‑server GPU communication but is limited in scale.

InfiniBand offers native RDMA, low latency, and loss‑less transmission, suited for medium‑scale clusters with higher cost.

ROCE Ethernet leverages the mature Ethernet ecosystem, lower cost, and rapid bandwidth growth, making it suitable for medium‑to‑large GPU training clusters.

DDC Full‑Schedule Network combines cell‑based switching and VOQ to address Ethernet congestion; it is still in research phases.

For more details, refer to the linked articles and resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM training InfiniBand RoCE NVLink GPU networking

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.