Evolution and Architecture of Google TPU Chips
This article outlines the development of Google's Tensor Processing Units (TPU) from the first generation to the latest seventh‑generation chip, detailing architectural improvements, performance specifications, integration into data‑center pods and mobile devices, and concludes with references to related AI‑hardware resources and promotional material.
Evolution of Google TPU
Google's Tensor Processing Unit (TPU) is a custom ASIC built to accelerate machine‑learning tasks. Since its debut, TPU has undergone several major revisions—TPU v1, v2, v3, v4 and the newest seventh‑generation chip—each delivering notable gains in process technology, chip size, memory capacity, clock speed, bandwidth, and thermal design power.
TPU v1 Overview
The first‑generation TPU targets 8‑bit matrix multiplication, driven by a CPU over PCIe 3.0. Fabricated on a 28 nm process at 700 MHz, it provides 40 W TDP, 28 MiB on‑chip memory, and a 4 MiB 32‑bit accumulator for a 256×256 systolic array. It also includes 8 GiB DDR3 SDRAM (34 GB/s bandwidth) and supports matrix‑multiply, convolution, and activation operations.
TPU v1 was primarily optimized for three model families popular around 2015: MLP (multilayer perceptron), CNN (convolutional neural networks) and RNN/LSTM (recurrent neural networks/long‑short‑term memory). Due to the high complexity of RNN/LSTM, early TPU hardware only accelerated inference for the first two categories.
TPU v2 Overview
Released in May 2017, TPU v2 introduced high‑bandwidth memory (HBM) with 16 GiB capacity and up to 600 GB/s memory bandwidth, delivering 45 TFLOPS of floating‑point performance. Four TPU v2 chips can be combined into a 180 TFLOPS module, and 64 such modules form a TPU v2 Pod with a theoretical peak of 11.5 PFLOPS.
TPU v3 Overview
TPU v3 adds 11 % more transistors, a 1.35× increase in clock speed, interconnect bandwidth and memory bandwidth, while only expanding die area by 6 %. The matrix unit count doubles, yielding a 2.7× theoretical performance boost over v2. Its 2D torus interconnect scales from 256 chips (v2) to 1,024 chips, raising pod performance from 12 PFLOPS to 126 PFLOPS (BF16).
TPU v4 Overview
Announced in 2021, TPU v4 moves to a 7 nm process—four times the transistor count of v3—and expands on‑chip memory from 9 MiB to 44 MiB while retaining 32 GB HBM2. Memory bandwidth rises 33 % to 1.2 TB/s. The architecture adopts a 3D torus interconnect, supporting up to 4,096 TPU v4 cores and delivering 1.126 exaflops of BF16 peak performance in a pod.
TPU in Mobile Devices
Beyond data‑center deployments, Google leveraged TPU technology for consumer products. The Pixel 2 and Pixel 3 incorporated the custom Pixel Visual Core, while the Pixel 4 introduced the Pixel Neural Core based on Edge TPU. The latest Tensor G3 chip further upgrades the CPU, GPU, ISP, and a dedicated TPU‑like accelerator to enable on‑device generative AI.
Related Links and Resources
For deeper dives, see the linked articles on GPU distributed training, Tensor Core evolution, and AI‑chip fundamentals. The article also includes promotional information for a bundled collection of technical e‑books and PDFs covering AI‑chip architectures, server fundamentals, and storage systems, with pricing and free‑update details.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.