Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training
The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.
Reminder: An update has been released; users who have already obtained the material can ignore the notice.
The latest package includes the second edition of "Server Fundamentals Complete Guide" and the "SSD Flash Technology Complete Guide" (PPT and PDF), with free updates for readers who have purchased the full "Architectural Engineer Full‑Store Materials" bundle (46 items).
Main update contents:
CPU updates (Intel/AMD architecture evolution, domestic CPU architectures)
GPU updates (NVIDIA GPU architectures from Fermi to Hopper, Rubin Ultra)
Memory, operating system, storage technology updates
Known issue fixes
More than 40 pages of PPT added
In distributed training, overall compute power does not increase linearly with the number of nodes because the acceleration ratio is less than 1; the main cause is the communication time between cards, so reducing inter‑card communication latency is crucial.
RDMA (Remote Direct Memory Access) is the key technology for lowering end‑to‑end latency in multi‑node, multi‑card training, allowing a host to directly access another host’s memory by bypassing the OS kernel.
RDMA can be implemented via InfiniBand, RoCEv1, RoCEv2, or iWARP. RoCEv1 is obsolete, iWARP is rarely used; the industry mainly adopts InfiniBand and RoCEv2.
Because RDMA bypasses the kernel protocol stack, its latency is tens of times lower than traditional TCP/IP. In intra‑cluster one‑hop tests, TCP/IP shows ~50 µs latency, while RoCEv2 achieves ~5 µs and InfiniBand ~2 µs.
After a compute task finishes, results must be quickly synchronized across nodes; insufficient bandwidth or high gradient‑transfer latency prolongs the waiting time and reduces the acceleration ratio.
AI‑computing clusters require low latency, high bandwidth, stable operation, massive scale, and easy maintenance. The most common network solutions meeting these needs are InfiniBand and RoCEv2.
1. InfiniBand Network Overview
Key components include Subnet Manager (SM), NICs, switches, and cables. NVIDIA dominates the market (>70% share). Current products feature 200 Gbps HDR and 400 Gbps NDR NICs, as well as 100 Gbps (SB7800), 200 Gbps (Quantum‑1), and 400 Gbps (Quantum‑2) switches.
The SM centrally computes and distributes forwarding tables, partitions, QoS, etc. InfiniBand uses dedicated cables and optical modules.
InfiniBand Network Features
(1) Lossless native networking: credit‑based flow control prevents buffer overflow and packet loss.
(2) Massive scalability: adaptive routing supports tens of thousands of GPUs; many large‑scale GPU clusters (e.g., Baidu, Microsoft) use InfiniBand.
Major vendors: NVIDIA, Intel, Cisco, HPE.
2. RoCEv2 Network Overview
RoCEv2 is a pure‑distributed network built on Ethernet, using NICs and switches that support RoCEv2. It offers broader compatibility and lower cost compared to InfiniBand.
Supported NIC vendors include NVIDIA, Intel, Broadcom; port speeds start at 50 Gbps and reach 400 Gbps in commercial products. Most data‑center switches (Huawei, H3C, etc.) support RDMA flow control for RoCEv2.
High‑performance switches use Tomahawk series ASICs.
RoCEv2 Network Features
RoCEv2 is more generic and cheaper, usable in both RDMA and traditional Ethernet environments, though its large‑scale throughput is slightly lower than InfiniBand.
Key vendors: Huawei, H3C, NVIDIA (ConnectX series).
3. InfiniBand vs. RoCEv2 Comparison
Technically, InfiniBand provides higher forwarding performance, faster fault recovery, better scalability, and lower operational complexity.
In practice, InfiniBand offers lower end‑to‑end latency and can support clusters with tens of thousands of GPUs, while RoCEv2 comfortably supports thousands of GPUs with slightly higher latency and cost.
Business considerations (performance, scale, operation, cost, vendor ecosystem) are summarized in a comparison table.
For readers who have previously purchased the full‑store bundle, free PDF updates are available by leaving a comment with the purchase record.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.