Fundamentals 19 min read

Evolution of NVIDIA GPU Architectures from Fermi to Ampere

This article outlines the progression of NVIDIA GPU architectures—from the early Fermi and Kepler designs through Maxwell, Pascal, Volta, Turing, and the latest Ampere—detailing compute capabilities, SM structures, FP64/FP32 ratios, Tensor Core introductions, and their impact on AI and high‑performance computing.

Architects' Tech Alliance

Mar 20, 2021

Fermi

Compute Capability: 2.0, 2.1. Each Streaming Multiprocessor (SM) contains 2 Warp Schedulers, 32 CUDA cores (arranged as 2 × 16), 2 FP64 units (ratio 1:3 to FP32), 16 LD/ST units, and 4 SFUs.

Kepler

Compute Capability: 3.0, 3.2, 3.5, 3.7. SM (named SMX) expands to 4 Warp Schedulers, 8 Dispatch Units, 192 CUDA cores, 64 dedicated FP64 units (ratio 1:1), and increased SFU/LD‑ST resources, improving double‑precision performance.

Maxwell

Compute Capability: 5.0, 5.2, 5.3. SM reduces to 4 Warp Schedulers, 8 Dispatch Units, 128 CUDA cores, and 32 SFU/LD‑ST units, with the FP64/FP32 ratio dropping to 1:32. The architecture focuses on efficiency and power‑per‑watt gains.

Pascal

Compute Capability: 6.0, 6.1, 6.2. SM contains 2 Warp Schedulers, 4 Dispatch Units, 64 CUDA cores, 32 FP64 units (ratio 1:2 restored), and 16 SFU/LD‑ST units. Introduces Tensor‑Core‑ready features, NVLink, HBM2 memory, and Unified Memory support.

Volta

Compute Capability: 7.0, 7.2. Major addition of Tensor Cores (8 per SM) for deep‑learning matrix operations. SM includes 4 Warp Schedulers, 4 Dispatch Units, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores, 8 Tensor Cores, 32 LD/ST units, and 4 SFUs. Supports per‑thread program counters and improved synchronization.

Turing

Compute Capability: 7.5. SM integrates 64 FP32 cores, 64 INT32 cores, 8 Tensor Cores, and a dedicated RT core for ray‑tracing. Each GPU comprises multiple GPCs, TPCs, and SMs, with configurable memory allocation for compute or graphics workloads.

Ampere

Compute Capability: 8.0. SM features 64 FP32 cores, 4 third‑generation Tensor Cores, and enhanced RT capabilities. The GA100/A100 GPU delivers up to 20× AI training speedup (FP16/FP32) and 2.5× FP64 performance, utilizes 7 nm process, 540 billion transistors, HBM2 memory stacks, and third‑generation NVLink.

Overall, NVIDIA’s GPU evolution shows a steady increase in SM count, specialized compute units (FP64, Tensor, RT), and memory bandwidth, driving advances in scientific computing, AI training/inference, and graphics rendering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI CUDA Hardware NVIDIA GPU architecture Tensor Core

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.