Artificial Intelligence 12 min read

Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training

Baidu's HPN network solves hash‑collision bottlenecks in large‑model training by combining TOR‑affinity scheduling with Dynamic Load Balancing on self‑developed switches, boosting physical network bandwidth efficiency to about 95%, improving throughput by roughly 10% and adding a further 1.5% training‑speed gain via the BCCL library.

Baidu Geek Talk

Jul 10, 2024

Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training

GPU communication performance is crucial for large model training. In HPN (High Performance Network) engineering practice, the core focus is on maximizing network hardware resources to enhance end-to-end training performance.

1. HPN Network — AIPod Architecture

Baidu's high-performance network HPN — AIPod uses an 8-rail network architecture. Taking the GPU A800 server as an example, it is equipped with 8 network cards, each connected to 8 TORs in a TOR aggregation group. At TOR and LEAF layers, Full Mesh interconnection is used. For three-layer RDMA networks, Full Mesh interconnection is also applied at LEAF and SPINE layers.

In large model training scenarios, considering different parallel strategies, most cross-machine communication traffic is same-GPU traffic. Same-GPU communication can pass through just one TOR hop at best, or LEAF at worst. Only different-GPU traffic may go through SPINE.

2. Hash Collision Problem

Hash collision is a typical issue in HPN networks, primarily because LEAF and SPINE layers use ECMP (Equal-Cost Multi-Path) for packet forwarding. RDMA uses Kernel Bypass and CPU Bypass operations, which can easily instantly saturate physical network hardware bandwidth, leading to hash collision.

Upward Collision: When Machine A initiates a collective communication operation, it sends messages at full 100 Gbps bandwidth to TOR 1. When TOR 1 forwards traffic to the LEAF layer, it randomly selects LEAF 1 or LEAF 2 based on hash strategy. If Machine B also needs to communicate with other machines through TOR 1, both flows may hash to the same link, causing each to only get 50 Gbps bandwidth.

Downward Collision: When Machine A sends data to Machine C and Machine E sends to Machine D, if both flows hash through the same LEAF, downward collision occurs, halving network throughput.

In large model training, collective communication has synchronous characteristics—if one GPU slows down, other GPUs in the same communication group wait synchronously, amplifying the hash collision impact.

3. Solution 1 — Increasing RDMA Flow Count

Adding RDMA QP (Queue Pair) numbers, such as 16 or 64 flows, reduces hash collision probability. However, increasing QPs brings additional overhead. Without hash collision, multi-QP performance is actually worse than single QP, but with hash collision present, multiple flows become necessary.

4. Solution 2 — TOR Affinity Scheduling

This solution focuses on reducing traffic sent to LEAF. By arranging communication groups so that adjacent Trainers are under the same TOR, half the traffic can be forwarded directly within TOR. Task submission should schedule machines on the same TOR, and adjacent Trainers should be scheduled on the same TOR during task startup.

5. Solution 3 — DLB Dynamic Load Balancing

DLB (Dynamic Load Balancing) is the ultimate solution. The root cause of hash collision is that flows with specific 5-tuple characteristics are hashed to fixed links. DLB allows packets from the same flow to be forwarded to different physical links based on real-time link load.

DLB is implemented based on InfiniBand's AR (Adaptive Routing) extension. When the NIC sends packets, it marks them with special AR bits. When TOR recognizes this mark, it forwards based on actual link load, sending packets to relatively idle physical links. Since different packets from the same flow take different paths, out-of-order delivery occurs and needs reassembly at the receiver.

DLB is implemented on Baidu's self-developed switches, achieving balanced traffic distribution across physical links and fundamentally solving hash collision. In Baidu's internal large model training scenarios, network bandwidth efficiency improves by about 10%.

Using "TOR Affinity Scheduling" combined with "DLB Dynamic Load Balancing" in Baidu Cloud's HPN cluster AIPod can completely solve the physical network cluster hash collision problem, achieving 95% physical network bandwidth efficiency.

Additionally, Baidu's self-developed collective communication library BCCL, combined with framework-level optimization for specific business traffic characteristics (such as local 2-vs-1 issues), provides deep communication-level optimization, improving end-to-end large model training performance by 1.5%.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

network optimization RDMA collective communication large model training Baidu Cloud DLB Dynamic Load Balancing Hash Collision HPN Network

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.