How to Eliminate Network Hash Collisions in Large‑Model Training
This article examines the impact of GPU communication bottlenecks on large‑model training, analyzes hash‑collision issues in high‑performance networks, and presents three practical solutions—including increasing RDMA streams, affinity‑aware scheduling, and dynamic load balancing—to boost effective network bandwidth up to 95%.
1 HPN Network – AIPod
The Baidu Baige high‑performance network (HPN) AIPod uses an 8‑rail architecture where each GPU A800 server connects eight NICs to eight TOR switches via a full‑mesh topology, both at TOR and leaf layers.
In large‑model training, most cross‑machine traffic involves same‑GPU‑card flows that ideally traverse only one TOR hop, while different‑card flows may pass through leaf or spine switches.
2 Hash Collision
Hash collisions arise because leaf and spine switches forward packets using ECMP, and RDMA bypasses the kernel, quickly saturating physical bandwidth. Collisions occur in both upstream and downstream directions, causing uneven link utilization and reducing effective bandwidth to around 70 Gbps on a 100 Gbps link.
Upstream collision : Multiple machines send traffic to the same TOR, which randomly selects leaf switches, leading to some links carrying only half the expected bandwidth.
Downstream collision : Traffic from different machines to different destinations may share the same leaf‑to‑spine path, again halving available bandwidth.
These collisions amplify the synchronization slowdown inherent in collective communication for large‑model training.
3 Solutions
Solution 1 – Increase RDMA Streams
By raising the number of RDMA QPs (e.g., from 2 to 16 or 64), traffic is spread across more hash buckets, dramatically lowering collision probability.
Monitoring after enabling multiple QPs shows a more uniform distribution of traffic across links.
Solution 2 – Affinity Scheduling
Arrange trainers so that pairs of machines that frequently communicate reside under the same TOR, reducing the amount of traffic that must traverse leaf switches and thus cutting collision chances.
Schedule tasks on the same TOR when submitting jobs.
Place adjacent trainers on the same TOR during task launch.
Solution 3 – Dynamic Load Balancing (DLB)
DLB leverages per‑packet load‑aware forwarding: NICs mark packets with a special AR bit, TORs detect the mark and forward packets over the least‑loaded physical link, balancing traffic in real time.
Although DLB may introduce packet reordering that requires reassembly, it can improve network bandwidth efficiency by roughly 10 % in Baidu’s internal large‑model training clusters.
Combining affinity scheduling with DLB raises effective network bandwidth to about 95 % and, together with Baidu’s custom collective communication library BCCL, yields an additional 1.5 % end‑to‑end performance gain for large‑model training.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
