Inside Google’s Data Centers: How SRE Manages Hardware, Borg, and Global Services
This article explains how Google’s Site Reliability Engineering team designs and operates uniform hardware in its data centers, uses the Borg cluster manager, implements storage layers, SDN networking, monitoring, and a sample Shakespeare search service to achieve high‑availability, scalable production services.
1. Hardware
Google stores most of its compute resources in self‑designed data centers that include custom power, cooling, networking, and hardware. All servers in a given center are essentially identical.
Physical server (Machine): Represents the actual hardware (sometimes a VM).
Software server (Server): Represents a service that provides functionality to users.
Typical topology (Figure 2‑1): about 10 machines per rack, several racks per row, rows form a cluster, multiple clusters form a data center, and adjacent data centers form a campus.
Google connects thousands of its own switches using a Clos architecture to create a high‑speed virtual network switch with tens of thousands of virtual ports, called Jupiter , delivering up to 1.3 Pb/s in its largest center. The global backbone B4, built on SDN (OpenFlow), provides massive bandwidth and dynamic bandwidth management.
2. System Software for Managing Physical Servers
Google developed large‑scale system‑management software to handle hardware failures, which occur thousands of times per year per cluster.
2.1 Managing Physical Servers
Borg is a distributed cluster operating system that schedules tasks across the cluster, similar to Apache Mesos. It runs user‑submitted jobs (e.g., long‑running services or batch MapReduce jobs). Each job consists of one or many task instances.
Borg’s next‑generation open‑source counterpart is Kubernetes.
When Borg starts a task, it assigns each instance to a physical machine and launches the program. Borg continuously monitors instances, terminating and restarting any that fail, possibly on a different machine.
To avoid a one‑to‑one mapping between instances and machines, Borg provides a name resolution system (BNS) that maps a hierarchical name like
/bns/<cluster>/<user>/<job>/<instance>to an IP address and port.
Borg also performs resource allocation based on job specifications (e.g., CPU cores, memory) and spreads instances across racks to prevent single points of failure. Over‑resource usage triggers termination and restart of the offending instance.
2.2 Storage
Job instances can use local disks for temporary files, but permanent storage is provided by cluster‑level systems similar to Lustre or HDFS.
The storage stack (Figure 2‑3) consists of:
D service: A file server running on most physical machines, abstracting individual disks or SSDs.
Colossus: A cluster‑wide file system (an evolution of GFS) that offers traditional file operations plus replication and encryption.
Higher‑level services built on Colossus:
Bigtable – a NoSQL database handling petabytes of data with row, column, and timestamp keys.
Spanner – a globally consistent SQL database.
Other services such as Blob Store.
2.3 Networking
Google’s network hardware is controlled via an OpenFlow‑based software‑defined network (SDN) rather than high‑end routers. A centralized controller computes optimal paths, reducing cost.
Bandwidth is managed by a Bandwidth Enforcer (BwE). Global load balancing (GSLB) operates on three levels: DNS‑based geographic balancing, service‑level balancing (e.g., YouTube, Maps), and RPC‑level balancing.
3. Other System Software
3.1 Lock Service
Chubby is a distributed lock service with a filesystem‑like API, using Paxos for consistency. It is used for master‑election and storing critical mappings such as BNS name‑to‑IP relationships.
3.2 Monitoring and Alerting
Google runs multiple instances of Borgmon to collect metrics, trigger alerts, compare pre‑ and post‑deployment performance, and track resource usage over time.
4. Software Infrastructure
Google’s low‑level software is highly multithreaded; each service runs an embedded HTTP server for debugging and statistics. All services communicate via RPC (Stubby), the open‑source implementation of which is gRPC.
Clients act as frontends, servers as backends. Data is serialized with Protocol Buffers, which are language‑neutral, compact, and much faster than XML.
5. Development Environment
All Google engineers share a single monolithic repository. Changes are submitted as Change Lists (CLs) and undergo code review. The build system distributes compilation across data‑center build servers, enabling massive parallel builds and continuous testing. Failed tests trigger notifications, and successful CLs can be auto‑deployed.
6. Shakespeare Search: A Demonstration Service
The service indexes all Shakespeare texts into a Bigtable and provides a frontend that handles user queries.
Batch processing (MapReduce) creates the index and writes it to Bigtable.
A frontend server receives HTTP requests, resolves the appropriate backend via GSLB, and forwards the query.
The backend queries Bigtable, returns results in a protobuf, which the frontend formats into HTML for the user.
Typical request flow (Figure 2‑4): DNS lookup → GSLB → Google Front End (GFE) → Shakespeare frontend → GSLB for backend address → backend → Bigtable → response.
Load testing shows each server can handle ~100 QPS; peak traffic of ~3 470 QPS requires at least 35–37 instances (N+2 redundancy). Traffic is distributed globally, with replicas of Bigtable deployed per region to reduce latency.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.