Tag

Topology Awareness

0 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
May 24, 2024 · Cloud Computing

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

The article explains how NCCL’s collective communication libraries enable efficient large‑scale model training by parsing GPU‑to‑NIC topology, forming flat‑ring and tree rings, improving logging and bandwidth metrics, detailing Ring AllReduce primitives, and proposing solutions to missing topology, metric, and mapping information for future optimization.

Collective CommunicationGPUNCCL
0 likes · 23 min read
Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training