Tagged articles

9 articles

Page 1 of 1

Nov 19, 2025 · Artificial Intelligence

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

This article explains how AI model training has evolved from single‑GPU workloads to massive distributed training using MPI for CPU‑centric communication and NCCL for GPU‑centric communication, covering their histories, core concepts, programming interfaces, topology discovery, protocol choices, and performance testing on multi‑GPU clusters.

AI distributed trainingGPU communicationHigh Performance Computing

0 likes · 71 min read

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

AI Cyberspace

Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI

0 likes · 33 min read

How NCCL Accelerates Distributed AI Training on GPUs

Architects' Tech Alliance

Apr 17, 2023 · Fundamentals

Overview of High‑Performance Computing (HPC): Architecture, Metrics, Cluster Management, Job Scheduling, and Parallel Programming Models

This article provides a comprehensive overview of high‑performance computing, covering system architectures, hardware components, performance metrics, network topologies, common parallel file systems, cluster management functions, mainstream job‑scheduling systems, and MPI‑based parallel programming models.

ClusterHPCHigh Performance Computing

0 likes · 14 min read

Overview of High‑Performance Computing (HPC): Architecture, Metrics, Cluster Management, Job Scheduling, and Parallel Programming Models

Architects' Tech Alliance

May 3, 2022 · Fundamentals

High‑Performance Computing Overview and Resource Guide

This article provides a comprehensive overview of high‑performance computing (HPC), covering its definition, hardware architectures, performance metrics, cluster components, parallel file systems, management and scheduling tools, as well as common MPI implementations and links to further technical resources.

ClusterFLOPSFile Systems

0 likes · 11 min read

High‑Performance Computing Overview and Resource Guide

DataFunSummit

Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelGlooHorovod

0 likes · 8 min read

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

Tencent Cloud Developer

May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIHorovodMPI

0 likes · 12 min read

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

Architects' Tech Alliance

Apr 12, 2018 · Fundamentals

Understanding MPI, OpenMPI, OpenMP and the Differences Between SMP, NUMA, and MPP Architectures

This article explains the concepts of MPI, OpenMPI, and OpenMP, compares three major server architectures—SMP, NUMA, and MPP—and discusses their performance characteristics, scalability limits, and typical application scenarios in high‑performance computing.

HPCMPIMPP

0 likes · 13 min read

Understanding MPI, OpenMPI, OpenMP and the Differences Between SMP, NUMA, and MPP Architectures

21CTO

Sep 19, 2015 · Artificial Intelligence

Why Distributed Machine Learning Needs More Data Than Speed

The article explains how distributed machine learning evolved from parallel computing to handle massive, long‑tail data sets, discusses the importance of scalability, fault recovery, and data‑parallel algorithms, and reviews frameworks such as MPI, MapReduce, and Pregel for building large‑scale AI systems.

Big DataData ParallelismLDA

0 likes · 24 min read

Why Distributed Machine Learning Needs More Data Than Speed

Efficient Ops

Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGDistributed computing

0 likes · 21 min read

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing