Operations 11 min read

How to Eliminate Network Hash Collisions in Large‑Model Training

This article examines the impact of GPU communication bottlenecks on large‑model training, analyzes hash‑collision issues in high‑performance networks, and presents three practical solutions—including increasing RDMA streams, affinity‑aware scheduling, and dynamic load balancing—to boost effective network bandwidth up to 95%.

Baidu Intelligent Cloud Tech Hub

Jul 3, 2024

How to Eliminate Network Hash Collisions in Large‑Model Training

1 HPN Network – AIPod

The Baidu Baige high‑performance network (HPN) AIPod uses an 8‑rail architecture where each GPU A800 server connects eight NICs to eight TOR switches via a full‑mesh topology, both at TOR and leaf layers.

In large‑model training, most cross‑machine traffic involves same‑GPU‑card flows that ideally traverse only one TOR hop, while different‑card flows may pass through leaf or spine switches.

2 Hash Collision

Hash collisions arise because leaf and spine switches forward packets using ECMP, and RDMA bypasses the kernel, quickly saturating physical bandwidth. Collisions occur in both upstream and downstream directions, causing uneven link utilization and reducing effective bandwidth to around 70 Gbps on a 100 Gbps link.

Upstream collision : Multiple machines send traffic to the same TOR, which randomly selects leaf switches, leading to some links carrying only half the expected bandwidth.

Downstream collision : Traffic from different machines to different destinations may share the same leaf‑to‑spine path, again halving available bandwidth.

These collisions amplify the synchronization slowdown inherent in collective communication for large‑model training.

3 Solutions

Solution 1 – Increase RDMA Streams

By raising the number of RDMA QPs (e.g., from 2 to 16 or 64), traffic is spread across more hash buckets, dramatically lowering collision probability.

Monitoring after enabling multiple QPs shows a more uniform distribution of traffic across links.

Solution 2 – Affinity Scheduling

Arrange trainers so that pairs of machines that frequently communicate reside under the same TOR, reducing the amount of traffic that must traverse leaf switches and thus cutting collision chances.

Schedule tasks on the same TOR when submitting jobs.

Place adjacent trainers on the same TOR during task launch.

Solution 3 – Dynamic Load Balancing (DLB)

DLB leverages per‑packet load‑aware forwarding: NICs mark packets with a special AR bit, TORs detect the mark and forward packets over the least‑loaded physical link, balancing traffic in real time.

Although DLB may introduce packet reordering that requires reassembly, it can improve network bandwidth efficiency by roughly 10 % in Baidu’s internal large‑model training clusters.

Combining affinity scheduling with DLB raises effective network bandwidth to about 95 % and, together with Baidu’s custom collective communication library BCCL, yields an additional 1.5 % end‑to‑end performance gain for large‑model training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

network performance RDMA large model training Hash Collision dynamic load balancing

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.