Operations 19 min read

GitHub GLB Director: Open‑Source High‑Performance Data‑Center Load Balancer

GitHub’s GLB Director is an open‑source, layer‑4 load balancer designed for data‑center environments that scales a single IP across thousands of servers, uses ECMP, a stateless director layer, DPDK‑accelerated packet processing, and health‑check mechanisms to provide high‑availability without disrupting existing connections.

High Availability Architecture
High Availability Architecture
High Availability Architecture
GitHub GLB Director: Open‑Source High‑Performance Data‑Center Load Balancer

GitHub processes tens of thousands of requests per second on metal cloud at the network edge. The previously introduced GLB is a scalable load‑balancing solution for bare‑metal data centers, supporting most external services and critical internal systems such as high‑availability MySQL clusters. The GLB Director is now open‑sourced.

GLB Director is a layer‑4 load balancer that expands a single IP across many physical machines while minimizing connection interruptions during changes. It sits in front of existing TCP services (e.g., HAProxy, Nginx) and allows them to scale without each machine needing a unique IP.

Using ECMP to Scale IP

To handle more traffic with a single IP, traffic must be split among backend servers and the load balancer itself must be scalable, which introduces a second layer of load balancing.

In simple routing a single best next‑hop is chosen, but most networks have multiple equal‑cost paths. ECMP distributes packets across these paths using a hash of source/destination IPs and ports, keeping packets of the same TCP flow on the same path while allowing path changes without breaking connections.

ECMP can also shard traffic across multiple servers that share the same IP via BGP or similar protocols, but it suffers when servers change because the stateless router may rebalance connections, causing interruptions.

Separating Director and Proxy

Traditional ECMP lacks per‑connection context. GitHub introduced a “director” layer that receives packets via ECMP but performs its own hash and stateful selection of backend proxies, keeping connections stable when proxies are added or removed.

The director stores only a static binary forwarding table (≈65 k rows, ~512 KB) mapping each hash bucket to a primary and secondary server. The hash is computed once per connection and used as an array index, guaranteeing consistent primary/secondary selection without per‑packet recomputation.

The table respects several invariants: relative order of servers is preserved, each server appears at most once per row, and distribution across rows and columns is roughly uniform. A hash‑based ordering of server IPs satisfies these constraints.

Proxy‑Related Operations

When adding or removing a proxy, the forwarding table is updated by swapping primary/secondary entries for the affected rows, allowing a graceful “drain” of traffic from a retiring proxy and a “fill” for a new one.

Only one proxy at a time may be in a non‑active state, simplifying state management and enabling fast failover comparable to the longest‑lived connections.

Data‑Center Encapsulation

To convey secondary‑server information without breaking existing routers, GitHub moved from IP‑in‑IP tunnels to UDP‑based encapsulations: first Foo‑over‑UDP, then custom GRE payloads, and finally Generic UDP Encapsulation (GUE) with the secondary IP placed in a private GUE header.

Using UDP allows per‑connection hash‑based source ports, enabling ECMP distribution across NIC receive queues, something not possible with pure IP‑in‑IP.

When a proxy sends traffic back to the client, it can bypass the director and use Direct Server Return, a common pattern for content providers where outbound traffic dominates.

Introducing DPDK

The GLB Director has been rewritten to use DPDK, allowing user‑space packet processing at line rate on ordinary NICs and scaling across many CPU cores. DPDK works alongside the Linux stack via SR‑IOV and flow bifurcation, directing traffic destined for the GLB IP to a virtual function while keeping other traffic on the physical function.

The director supports IPv4/IPv6 TCP payloads and inbound ICMP “Fragmentation Required” messages for Path MTU Discovery.

Testing with Scapy

To validate the DPDK implementation, a test harness uses Scapy and libpcap‑based virtual interfaces to generate end‑to‑end packet flows, including full GUE encapsulation, ICMPv4/v6 handling, and TCP port hashing.

Health Checks

A service called glb-healthcheck continuously probes each backend’s GUE tunnel and HTTP port; on failure it promotes a standby to primary, providing graceful failover without breaking existing connections.

iptables “Second Chance” Module

Each proxy runs a Netfilter target that inspects incoming GUE packets; if the packet does not belong to a local TCP flow, it is forwarded to the next proxy, giving the connection a “second chance” to be handled elsewhere.

Open‑Source Release

The complete GLB Director, including director, proxy, DPDK integration, health‑check service, and iptables module, is available at https://github.com/github/glb-director for anyone to use as a generic data‑center load‑balancing solution.

high-availabilityload balancingOpen SourceDPDKdata centerECMP
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.