An Overview of Google’s Borg Cluster Management System
This article provides a comprehensive overview of Google’s Borg system, detailing its purpose, user perspective, workload types, cluster and cell architecture, job and task management, scheduling algorithms, scalability techniques, and availability mechanisms for large‑scale distributed environments.
1. Introduction
Google’s server‑cluster management system, similar to Baidu’s Matrix, Alibaba’s Fuxi, Tencent’s Taifeng platform, and the open‑source Mesos.
Borg provides three main benefits: it hides the details of resource management and failure handling so users can focus on application development; it operates with very high reliability and availability, supporting equally reliable applications; and it lets Google run workloads across tens of thousands of machines effectively.
2. The user perspective
Borg is primarily aimed at system administrators and Google developers who run services and applications as jobs. Each job consists of one or more tasks and runs inside a cell, which is a logical grouping of machines (an IDC).
2.1 The workload
Services running on Borg are usually divided into two categories:
prod: long‑running, latency‑sensitive services such as Gmail, Google Docs, Search, and internal infrastructure platforms like Bigtable, GFS.
non‑prod: batch jobs that are latency‑insensitive and typically finish within minutes to days.
These two types are often mixed within a cell, and scheduling policies must consider their differing characteristics and the IDC’s usage patterns (e.g., high daytime load for user‑facing services, low load at night, etc.).
The primary goal of Borg is to improve machine utilization.
Many Google application frameworks are built on top of Borg, such as MapReduce, FlumeJava, Millwheel, Pregel, as well as storage services like GFS, Bigtable, and Megastore. In these cases the framework’s master and jobs run inside Borg, distinct from Borg’s own master and job concepts.
In a representative cell, prod jobs consume about 70 % of CPU capacity but only 55 % of memory, while accounting for 60 % of CPU usage and 85 % of memory usage.
2.2 Clusters and cells
Data center → cluster → cell.
A cluster usually hosts one large cell and a few smaller test or special‑purpose cells. Cells of medium size contain roughly 10 k servers. Each machine advertises resources such as CPU, memory, network, disk, processor type, SSD, IP address, etc.
When a user submits a job, Borg schedules it onto machines, monitors its state, and may restart it if it fails.
2.3 Jobs and tasks
A job’s attributes include name, owner, tasks, and scheduling constraints (e.g., CPU architecture, OS version, IP address). Constraints can be hard or soft.
Each job runs in a single cell and contains N tasks; each task may run multiple processes. Borg uses lightweight containers (cgroups) rather than full VMs for isolation.
Tasks also have attributes such as resource demand and an index; typically all tasks in a job share the same demand.
Users interact with jobs via RPCs to Borg, usually through a command‑line tool, other Borg jobs, or monitoring systems.
Jobs are described using Google’s internal BCL language, which can be updated via a state‑machine‑based process.
Updates are performed in a rolling fashion and can be limited by the number of task disruptions they cause; updates that would cause more disruptions are skipped.
2.4 Allocs
An alloc is essentially a container that runs one or more tasks. All resources reserved for an alloc are considered allocated, even if not currently used, and cannot be given to batch tasks. Alloc resources can be shared concurrently by multiple tasks or reused after a task finishes.
For example, a web‑server job and a log‑collection job can share the same alloc, allowing the log collector to pull logs from the server’s local disk.
Typically a task is associated with one alloc, and a job with an alloc set.
2.5 Priority, quota, and admission control
Each task has a priority (positive integer). Borg defines four priority bands: monitoring, production, batch, and best‑effort. High‑priority tasks can preempt lower‑priority ones, except production tasks cannot preempt each other.
Quota represents the amount of resources (CPU, memory, network bandwidth, disk, etc.) a task may consume. Higher‑priority tasks usually require larger quotas, and users are advised to request slightly more than expected to avoid being killed for over‑commitment.
Priority 0 tasks can request unlimited quota but may remain pending due to insufficient resources.
2.6 Naming and monitoring
Borg Name Service (BNS) provides automatic service discovery. Each task receives a BNS name composed of cell name, job name, and task index; the name and the task’s hostname + port are stored in Chubby and resolved via DNS.
Each task runs an internal HTTP server exposing health and performance metrics. Borg monitors these URLs and restarts tasks that return errors.
Google’s Sigma UI lets users view job, cell, and task status, resource utilization, logs, and history. Logs are rotated and retained for a period after task completion.
If a job is not running, Borg supplies a “why pending?” annotation and guidance for adjusting resource requests.
3. Borg architecture
Each cell contains a controller called Borgmaster, and every machine runs a Borglet agent. Both master and agent are written in C++.
3.1 Borgmaster
Each master runs two processes: a front‑end handling user requests (job creation, queries, etc.) and a scheduler managing state machines for all objects, communicating with Borglets, and offering a web UI as a backup to Sigma.
Five master replicas each keep an in‑memory copy of the cell state and persist it to a highly‑available Paxos‑based store. One replica is elected leader via Paxos and handles all state‑changing requests. If the leader fails, Chubby elects a new leader within ~10 seconds (up to a minute for very large cells).
Masters periodically checkpoint snapshots and change logs to enable recovery to any point in time.
3.2 Scheduling
When a job is submitted, its metadata is stored in the Paxos store and its tasks are placed in a pending queue. A separate scheduler process scans the queue asynchronously and, upon finding a machine that satisfies a task’s constraints and has sufficient resources, assigns the task to that machine.
The scan proceeds from high to low priority, using round‑robin within each priority to ensure fairness.
Scheduling consists of two phases:
Feasibility checking – find machines that can run the task.
Scoring – pick the best machine among the feasible ones.
During feasibility checking, the scheduler accounts for both free resources and resources held by lower‑priority jobs that can be preempted. For prod tasks, the machine’s available resources are reduced by the task’s limit; for non‑prod tasks, only the resources already in use are subtracted.
Scoring uses Borg’s built‑in optimization metrics (minimize preemptions, prefer machines with cached packages, reduce hardware‑failure impact, mix priority levels, etc.) and may also incorporate user‑provided preferences.
Two scoring models are used:
E‑PVM (worst‑fit) – spreads tasks across machines to keep spare capacity, at the cost of fragmentation.
Best‑fit – packs tasks tightly to reduce fragmentation, which can hurt batch workloads.
Borg employs a hybrid model between worst‑fit and best‑fit to minimize idle resources.
If the selected machine lacks sufficient free resources, preemption occurs: lower‑priority jobs are evicted until enough capacity is available. Preempted jobs return to the pending queue and may eventually die if no resources become free.
Because most packages are immutable, Borg prefers machines that already have the required package cached, reducing deployment latency (≈ 25 s).
3.3 Borglet
Borglet is the per‑machine agent responsible for starting/stopping tasks, restarting failed tasks, isolating resources via cgroups and kernel parameters, logging, and reporting task status.
Borgmaster periodically polls all Borglets to collect task states. Heartbeat messages may be compressed to send only diffs. If a Borglet becomes unresponsive, the master assumes the machine has failed and reschedules its tasks. If the Borglet later recovers, the master kills any lingering tasks on that machine.
Master failures do not affect Borglets or running tasks, and Borglet crashes do not affect tasks.
3.4 Scalability
A typical Borgmaster manages thousands of machines; busy cells may see >10 k task submissions per minute and require 10–14 CPU cores and >50 GB RAM. To handle scale, Borgmaster separates the scheduler into its own process, allowing parallelism and fault‑tolerance.
The scheduler’s responsibilities include receiving cell state from the elected master, updating its local copy, performing pre‑scheduling, and notifying the master of scheduling decisions.
Additional scalability optimizations:
Score caching – reuse machine scores until resources or task attributes change.
Equivalence classes – group tasks with identical requirements and score the class once.
Relaxed randomization – sample a subset of machines or dimensions for scoring to improve efficiency.
4. Availability
In large distributed systems, single points of failure are common. Borg improves job availability through several mechanisms:
Automatically rescheduling evicted tasks on new machines.
Spreading tasks across failure domains (machines, racks, power zones) to reduce correlated failures.
Rate‑limiting task disruptions and limiting the number of simultaneously down tasks during maintenance.
Using declarative desired‑state representations and idempotent operations so failed clients can safely retry.
Rate‑limiting placement of tasks on unreachable machines to distinguish between large‑scale failures and network partitions.
Avoiding repeat task‑machine pairings that cause crashes.
Re‑running log‑saver tasks to recover intermediate data written to local disks, with configurable retry periods (commonly a few days).
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.