Tag

NCCL

0 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
May 24, 2024 · Cloud Computing

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

The article explains how NCCL’s collective communication libraries enable efficient large‑scale model training by parsing GPU‑to‑NIC topology, forming flat‑ring and tree rings, improving logging and bandwidth metrics, detailing Ring AllReduce primitives, and proposing solutions to missing topology, metric, and mapping information for future optimization.

Collective CommunicationGPUNCCL
0 likes · 23 min read
Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training
DataFunSummit
DataFunSummit
Apr 7, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server

This article explains how Google Cloud Vertex AI improves large‑scale distributed machine learning training performance by addressing the memory‑wall challenge with Fast Socket network stack enhancements for NCCL and a Reduction Server that accelerates gradient aggregation, delivering higher throughput and lower TCO for AI workloads.

Fast SocketGPUNCCL
0 likes · 19 min read
Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server
DataFunTalk
DataFunTalk
Mar 17, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server

This article explains how Google Vertex AI tackles the memory‑wall challenge of large‑scale distributed training by introducing Fast Socket, a high‑performance NCCL network stack, and a Reduction Server that halves gradient‑aggregation traffic, delivering significant speed‑up and cost‑reduction for AI workloads.

AI performanceFast SocketNCCL
0 likes · 19 min read
Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server
Tencent Cloud Developer
Tencent Cloud Developer
May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIHorovodMPI
0 likes · 12 min read
Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL