An Overview of NVIDIA NVLink: Architecture, Topology, and Performance
This article explains NVIDIA's NVLink interconnect technology, covering its history, protocol layers, bandwidth advantages over PCIe, topologies such as the HGX-1/DGX-1 mesh, the NVSwitch extension, and performance gains for deep‑learning and high‑performance computing workloads.
Following the previous discussion on GPUDirect P2P, this article introduces NVIDIA's NVLink bus protocol, which was created to overcome the bandwidth limitations of PCI Express and improve GPU‑GPU and GPU‑CPU communication.
NVLink was announced at the NVIDIA GTC 2014 event, where CEO Jensen Huang showcased the GeForce Titan Z and hinted at the upcoming Pascal architecture; the technology was briefly presented but signaled NVIDIA's ambition for higher‑performance interconnects.
NVLink provides dramatically higher bandwidth: the first product, the P100 (released 2016), offers 160 GB/s per GPU (about five times PCIe Gen3 × 16), while the V100 (GTC 2017) with NVLink 2.0 reaches roughly 300 GB/s, about ten times PCIe.
Before NVLink, GPU communication relied on PCIe Gen3 (32 GB/s bidirectional per GPU) and memory bandwidth (e.g., GDDR5X 547.7 GB/s, HBM2 900 GB/s) which, while high, left inter‑GPU and GPU‑CPU transfers bottlenecked by the PCIe bus; CPU‑CPU links such as Intel's QPI offered only 25.6 GB/s.
NVLink can interconnect GPUs with each other, GPUs with CPUs, and even CPUs with CPUs, prompting NVIDIA to partner with IBM in the OpenPower alliance to explore CPU‑GPU integration beyond the x86 ecosystem.
The NVLink controller consists of three layers: physical (PHY), data‑link (DL), and transaction (TL).
On the P100, NVLink 1.0 provides four links per GPU, each delivering 40 GB/s bidirectional, for a total of 160 GB/s. The V100’s NVLink 2.0 adds two more links (six total) and raises each link to 50 GB/s, achieving up to 300 GB/s per GPU.
The HGX‑1/DGX‑1 system uses an eight‑GPU hybrid cube‑mesh topology; although each V100 has six NVLink ports, full connectivity is limited to two links (100 GB/s) between any pair, and GPU‑CPU communication still uses PCIe while CPU‑CPU uses QPI.
To address the mesh limitations, NVIDIA introduced NVSwitch at GTC 2018, a node‑level switch that enables full‑mesh connectivity of up to 16 GPUs, each communicating at 300 GB/s, providing a unified 0.5 TB memory space and roughly 2 PetaFLOPS of compute power.
Performance results show that NVLink can improve server throughput by about 31 %, and systems like the DGX‑2 equipped with NVSwitch achieve more than double the acceleration for deep‑learning and high‑performance computing workloads.
Author: 撷峰 (Jie Feng) – Source: 云栖社区.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.