Artificial Intelligence 13 min read

Detailed Overview of GPU Server Architectures: A100/A800 and H100/H800 Nodes

This article provides a comprehensive technical overview of large‑scale GPU server architectures, detailing the component topology of 8‑GPU A100/A800 and H100/H800 nodes, explaining storage network cards, NVSwitch interconnects, bandwidth calculations, and the trade‑offs between RoCEv2 and InfiniBand for AI workloads.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Detailed Overview of GPU Server Architectures: A100/A800 and H100/H800 Nodes

It is well known that large‑model training typically uses clusters where each server is equipped with multiple GPUs. Building on the previous article "High‑Performance GPU Server AI Network Architecture (Part 1)", this piece dives deeper into common GPU system architectures.

8 × NVIDIA A100 GPU / 8 × NVIDIA A800 GPU Nodes

The topology shown includes two CPU chips (with their attached memory in a NUMA layout), two storage network adapters, four PCIe Gen4 switch chips, six NVSwitch chips, eight GPUs, and eight GPU‑dedicated network adapters.

Two CPU chips (NUMA architecture) : handle general‑purpose computation.

Two storage network adapter cards : provide access to distributed storage.

Four PCIe Gen4 switch chips : enable high‑speed PCIe communication.

Six NVSwitch chips : allow ultra‑fast GPU‑to‑GPU communication essential for large‑scale deep‑learning nodes.

Eight GPUs (A100) : the primary compute units for AI and deep‑learning workloads.

Eight GPU‑dedicated network adapters : optimize inter‑GPU communication.

The following sections decode each component in detail.

Storage Network Cards

In GPU architectures, storage network cards connect to the CPU via PCIe and facilitate communication with distributed storage systems. Their main functions are:

Read/write distributed storage data, crucial for feeding training data and checkpointing results.

Node management tasks such as SSH access, performance monitoring, and data collection.

Although the official recommendation is BF3 DPU, alternatives like RoCE (cost‑effective) or InfiniBand (high‑performance) can be used if bandwidth requirements are met.

NVSwitch Network Structure

In a fully‑meshed topology, each GPU connects directly to every other GPU. Eight GPUs are linked via six NVSwitch chips, forming the NVSwitch architecture.

With NVLink 3 (50 GB/s per lane), a fully‑meshed A100 node achieves 12 × 50 GB/s = 600 GB/s total bidirectional bandwidth (300 GB/s per direction). The A800 reduces NVLink lanes to eight, yielding 8 × 50 GB/s = 400 GB/s total (200 GB/s per direction).

The following image illustrates the nvidia‑smi topology for an 8 × A800 device:

GPU‑to‑GPU connections (top‑left): all marked NV8, indicating eight NVLink links.

NIC connections: "NODE" within the same CPU chip (no NUMA crossing), "SYS" across CPUs (requires NUMA traversal).

GPU‑to‑NIC connections: "NODE" when on the same CPU and PCIe switch, "NNODE" when on the same CPU but different PCIe switches, and "SYS" across CPUs.

GPU Node Interconnect Architecture

The diagram below shows the inter‑node connectivity:

Compute Network

Connects GPU nodes to support parallel computation, data exchange, and coordinated execution of large‑scale tasks.

Storage Network

Links GPU nodes to storage systems for massive data reads/writes, loading data into GPU memory and persisting results.

RDMA (Remote Direct Memory Access) is critical for high‑performance AI workloads; choosing between RoCEv2 (cost‑effective) and InfiniBand (top‑tier performance) depends on budget and performance goals.

Public cloud providers often use RoCEv2 (e.g., CX configuration with 8 GPU instances each at 8 × 100 Gbps). While RoCEv2 is cheaper, it meets most performance needs.

Bandwidth Bottlenecks in Data Links

Intra‑host GPU‑to‑GPU via NVLink: 600 GB/s bidirectional (300 GB/s per direction).

GPU‑to‑NIC within the same host via PCIe Gen4 switches: 64 GB/s bidirectional (32 GB/s per direction).

Inter‑host GPU‑to‑GPU via NICs: typically 100 Gbps (12.5 GB/s) per direction, far lower than intra‑host bandwidth.

Thus, using 400 Gbps NICs provides little benefit unless the underlying PCIe generation (Gen5) can exploit that bandwidth.

8 × NVIDIA H100 / 8 × NVIDIA H800 Hosts

H100 Host Internal Hardware Topology

The overall architecture resembles the A100 eight‑GPU system but differs in NVSwitch count and bandwidth upgrades.

Each H100 host contains four chips (two fewer than the A100 configuration).

H100 chips are built on a 4 nm process and feature 18 Gen4 NVLink connections per chip, delivering a total bidirectional bandwidth of 18 × 25 GB/s = 900 GB/s.

H100 GPU Chip

Manufactured with cutting‑edge 4 nm technology.

Bottom row hosts 18 Gen4 NVLink links, giving 900 GB/s total bidirectional bandwidth (25 GB/s per lane).

Central blue area is the L2 cache for fast temporary data storage.

Both sides integrate HBM memory chips for high‑bandwidth graphics memory.

Source: https://community.fs.com/cn/article/unveiling-the-foundations-of-gpu-computing1.html

High Performance ComputingGPUServer ArchitectureAI trainingNVLinkNVSwitch
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.