Cloud Native 21 min read

High‑Availability Architecture for etcd in Ant Group’s Massive Kubernetes Clusters

The article describes how Ant Group operates a world‑largest Kubernetes deployment of over 10,000 nodes, details the performance challenges of the etcd key‑value store at such scale, and outlines a comprehensive set of hardware upgrades, configuration tuning, monitoring, data‑splitting, and future distributed‑etcd strategies to achieve robust high‑availability.

High Availability Architecture

Aug 31, 2021

Ant Group maintains what is arguably the world’s largest Kubernetes (k8s) cluster, exceeding the official 5k‑node scalability guideline and reaching more than 10k nodes, effectively turning the official “Mount Tai” benchmark into a “Mount Everest” for k8s scale‑out.

The stability of this massive cluster hinges on the reliability of its foundational KV store, etcd, which stores all k8s resources, custom CRDs, and event data. The article lists the roles of etcd and its surrounding components (kube‑apiserver, kubelet, controller‑manager, scheduler) and explains why etcd functions both as a KV database and a message router.

At such scale, etcd faces extreme pressure: KV data >1 M entries, event data >100 k entries, read traffic peaks >300 k qpm, write traffic peaks >200 k qpm, CPU usage often >900 %, memory >60 GiB, disk usage >100 GiB, and thousands of goroutines and OS threads. These pressures caused severe latency spikes, OOM, and multi‑minute list‑all operations when the cluster grew beyond 7 k nodes.

To address these issues, Ant Group applied a series of high‑availability strategies:

Hardware upgrades: replacing SATA disks with NVMe SSDs (boosting random write to >70 MiB/s) and mounting event‑heavy etcd instances on tmpfs for a ~20 % performance gain.

Filesystem tuning: testing XFS with larger block size (16 KiB) showed modest write improvements, but further gains required memory‑index optimizations.

Kernel tuning: disabling transparent huge pages to eliminate performance jitter.

Configuration tuning focused on etcd’s write‑batch and compaction parameters. Batch write size and interval were reduced proportionally to node count, while compaction intervals were lengthened at the etcd layer (e.g., 1 hour) and fine‑tuned at the kube‑apiserver layer based on cluster size and traffic patterns. Sleep intervals and batch limits for compaction were also adjusted to avoid lock contention.

Operational practices include a dedicated etcd monitoring platform that provides metrics such as longest‑N KV, top‑N KV, namespace usage, verb‑resource statistics, connection counts, client sources, and redundant data analysis. These insights enable actions like client rate‑limiting, load balancing, cluster splitting, and removal of stale data, which together reduce latency and improve QPS.

Cluster splitting strategies separate high‑volume event data into dedicated etcd clusters, and further partition data by resource type (pods/configmaps, nodes/services, events/leases). This reduces per‑node data size and balances client load across etcd nodes.

Future work focuses on scaling out etcd via distributed clusters, exploring both proxy‑based and proxy‑less designs. Proxy‑based setups incur a 20‑25 % performance penalty, while proxy‑less designs promise up to ~13 % overall latency and QPS improvements. Additional directions include leveraging community features, integrating multi‑boltdb architectures, and adopting alternative KV back‑ends such as OBKV‑based etcd APIs.

Overall, Ant Group’s continuous investment in etcd performance, monitoring, and architectural evolution has kept the k8s control plane free of P‑level failures for over six months and positions the platform to handle even larger node counts in the future.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance tuning etcd scale-out

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.