How Google’s TPU Systolic Array Powered AlphaGo and Large Language Models

Google’s Tensor Processing Unit (TPU) uses a systolic array architecture and low‑precision quantization to overcome the Von Neumann bottleneck, delivering orders‑of‑magnitude higher throughput and energy efficiency for matrix‑multiplication‑heavy AI workloads—from AlphaGo’s inference to today’s massive language models.

Past Memory Big Data
Past Memory Big Data
Past Memory Big Data
How Google’s TPU Systolic Array Powered AlphaGo and Large Language Models

Why Google Built a Custom Chip

In 2016 AlphaGo defeated world champion Lee Se‑doul, showcasing AI’s breakthrough. The match ran on hardware Google had been developing in secret for over a year. Google’s Tensor Processing Unit (TPU) represents a shift away from general‑purpose CPUs toward purpose‑built silicon that does less but achieves more.

Demand for TPU

In 2013 Google estimated that if every Android user performed three minutes of voice search daily, the company would need to double its global data‑center footprint. Building more traditional CPU‑filled servers was economically infeasible, and Moore’s Law was slowing, so waiting for the next Intel CPU was not an option.

The root cause is the Von Neumann architecture, where the processor and memory share a bus. Moving data across that bus consumes more energy than the computation itself, a problem especially acute for neural networks that repeatedly perform matrix multiplication.

Systolic Array: A Different Computing Model

The TPU’s core is a Systolic Array —named after the Greek word for “heartbeat”—that streams data rhythmically through the chip. Compared with CPUs (a single worker shuttling a bucket) and GPUs (thousands of workers causing traffic), the systolic array lines up workers so that data passes hand‑to‑hand without returning to memory.

In the first‑generation TPU, a 256 × 256 array provides 65,536 multiply‑accumulate units. The computation proceeds as follows:

Weights are loaded once into each unit and remain stationary.

Input activations flow in from the left, one row at a time.

Each unit multiplies the incoming activation by its resident weight.

The product is added to a running sum and passed to the next unit on the right.

Partial sums flow downward.

After all data have traversed the array, the final result emerges from the bottom.

This design reads data from memory a single time, then reuses it thousands of times as it ripples through the array, eliminating the memory‑access bottleneck and dramatically reducing energy consumption.

Supporting Architecture

Matrix Multiply Unit (MXU) : The systolic array itself. TPU v1 used an 8‑bit 256 × 256 array; later versions switched to 128 × 128 with BFloat16 for training, then back to 256 × 256 in v6 for four‑fold throughput.

Unified Buffer : 24 MiB on‑chip SRAM that buffers inputs, intermediate activations, and outputs, offering bandwidth far higher than external memory.

Vector Processing Unit (VPU) : Handles non‑linear operations such as ReLU, sigmoid, and tanh with dedicated circuits that compute each activation in a single cycle.

Accumulators : 32‑bit registers that collect 16‑bit products to avoid overflow during repeated accumulation.

Weight FIFO Buffer : Uses double‑buffering to hide memory latency by loading the next weight block while the current block is being used.

High‑Bandwidth Memory (HBM) : Early TPUs used DDR3 (34 GB/s). The Ironwood TPU v7 employs HBM with 7.4 TB/s bandwidth—a 217× increase.

Precision Advantage

TPUs rely on quantization—using lower‑precision numbers than traditional floating‑point—to boost efficiency. An 8‑bit multiplier occupies ~64 silicon units, whereas a 32‑bit multiplier needs ~576, explaining why TPU v1 could fit 65,536 units on a modest die.

For inference, 8‑bit integers cut memory demand by four‑fold with negligible impact on classification accuracy. Training requires higher precision; Google introduced BFloat16 , which keeps the 8‑bit exponent of FP32 but reduces the mantissa to 7 bits, matching the dynamic‑range needs of neural networks.

Modern TPUs support multiple precision modes:

Training with BFloat16.

Inference with INT8 (speed‑up on TPU v5e).

FP8 on Ironwood, the first TPU to natively support this format.

Evolution Timeline

TPU v1 (2015) : 28 nm, 40 W, 92 TOPS (8‑bit) for inference; 15‑30× speed‑up and 30‑80× energy‑efficiency over GPUs.

TPU v2 (2017) : Added training support, HBM, and inter‑chip interconnect (ICI), enabling TPU Pods of 256 chips delivering 11.5 PFLOPS.

TPU v3 (2018) : Doubled performance to 420 TFLOPS per chip, introduced liquid cooling, and expanded Pods to 1,024 chips.

TPU v4 (2021) : Added SparseCores for recommendation workloads, optical‑circuit switching (OCS), and a 3‑D torus network to lower latency.

Ironwood (TPU v7, 2025) : Designed for the inference era, 4,614 TFLOPS per chip, 192 GB HBM, 7.4 TB/s bandwidth.

Conclusion and Trade‑offs

TPUs have demonstrated real‑world impact: a single TPU processes over 100 million Google Photos per day; AlphaFold solved a 50‑year protein‑folding problem on TPUs; training a 540‑billion‑parameter PaLM model on 6,144 TPU v4 chips achieved 57.8 % hardware utilization in 50 days.

TPUs excel at large‑scale language‑model training/inference, CNNs and Transformers with heavy matrix ops, high‑throughput batch processing, and energy‑constrained workloads. However, GPUs remain preferable for native PyTorch development, small‑batch scenarios, mixed AI‑graphics workloads, and rapid prototyping across clouds.

Overall, TPUs illustrate the industry shift toward domain‑specific accelerators: when per‑query operations reach trillions, the general‑purpose CPU hits physical limits, and custom silicon delivers performance gains unattainable by generic processor optimizations.

References

BFloat16: The secret to high performance on Cloud TPUs – https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

An in‑depth look at Google’s first Tensor Processing Unit (TPU) – https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

TPU Architecture – https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm

Introduction to Cloud TPU – https://docs.cloud.google.com/tpu/docs/intro-to-tpu

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningQuantizationGoogleAI hardwareTPUSystolic Array
Past Memory Big Data
Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.