Cloud Computing 11 min read

Dynamic Kubernetes Cluster Scaling at Airbnb

Airbnb’s engineering team describes how they migrated to Kubernetes, evolved their clusters through three stages, and built a custom gRPC‑based expander for the Cluster Autoscaler to achieve flexible, cost‑effective, and automated scaling across hundreds of heterogeneous clusters.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Dynamic Kubernetes Cluster Scaling at Airbnb

Airbnb's Kubernetes Clusters

Over the past years Airbnb migrated almost all online services from manually managed EC2 instances to Kubernetes, operating thousands of nodes across nearly a hundred clusters. The evolution is described in three stages: homogeneous clusters with manual scaling, multi‑cluster types with independent scaling, and heterogeneous clusters with automated scaling.

Stage 1: Homogeneous Clusters, Manual Scaling

Initially each service ran on dedicated machines and capacity was manually allocated. After moving to Kubernetes, services ran in a multi‑tenant environment, reducing waste and consolidating capacity management in the control plane, though scaling remained manual.

Stage 2: Multiple Cluster Types, Independent Scaling

Different workloads required distinct configurations, leading to abstract cluster types. Manual scaling became unsustainable, so the Kubernetes Cluster Autoscaler was added to automatically add nodes for pending pods and remove under‑utilized nodes, saving about 5% of cloud costs.

Stage 3: Heterogeneous Clusters, Automated Scaling

With over 30 cluster types and 100+ clusters, management grew complex. By consolidating into heterogeneous clusters under a single control plane, Airbnb reduced testing overhead, improved utilization, and enabled custom scaling logic beyond the default autoscaler.

Cluster Autoscaler Improvements

Custom gRPC Expander

The most significant improvement is a pluggable custom expander that determines which node groups to scale. An internal gRPC client (the Expander) sends protobuf‑encoded node‑group information to an external gRPC server that implements business‑specific scaling decisions.

service Expander {
  rpc BestOptions (BestOptionsRequest) returns (BestOptionsResponse)
}
message BestOptionsRequest {
  repeated Option options;
  map
nodeInfoMap;
}
message BestOptionsResponse {
  repeated Option options;
}
message Option {
  // ID of node to uniquely identify the nodeGroup
  string nodeGroupId;
  int32 nodeCount;
  string debug;
  repeated k8s.io.api.core.v1.Pod pod;
}

The server runs independently, allowing rapid iteration on business logic without modifying the autoscaler core. It also supports fallback to multiple expanders for fault tolerance.

Airbnb has used this solution internally since 2022, and the custom expander was accepted upstream and will be available in Cluster Autoscaler v1.24.0.

Conclusion

Over four years, Airbnb’s enhancements to the Cluster Autoscaler have enabled sophisticated, cost‑aware scaling strategies across heterogeneous clusters, and the contributed features benefit the broader Kubernetes community.

kubernetesgRPCAirbnbcloud scalingCluster Autoscaler
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.