How Ant Group Scales etcd for 10k‑Node Kubernetes Clusters: High‑Availability Secrets
This article examines Ant Group's strategies for achieving high availability of the etcd key‑value store in a massive 10,000‑node Kubernetes cluster, detailing challenges, performance metrics, filesystem upgrades, tuning parameters, operational platform insights, and future directions for distributed etcd deployments.
Ant Group operates one of the world’s largest Kubernetes clusters, exceeding the official 5k‑node limit and reaching 10k nodes. The article focuses on the high‑availability construction of the etcd layer, the key KV store that underpins the cluster.
Challenges
etcd is the KV database for the cluster and serves multiple roles (etcd cluster, kube‑apiserver proxy, kubelet producer/consumer, controller‑manager, scheduler). In a >7k‑node cluster the KV size, event volume and read/write pressure explode, leading to minutes‑level latency, OOM, and long list‑all operations.
KV data > 1M entries
Event data > 100k entries
Read QPM > 300k, event read > 10k QPM
Write QPM > 200k, event write > 15k QPM
CPU usage often > 900%
Memory RSS > 60 GiB
Disk usage > 100 GiB
Goroutine count > 9k
User‑space threads > 1.6k
GC latency ~15 ms
High‑Availability Strategies
Improving distributed system availability can be done by:
Enhancing stability and performance of the component itself.
Fine‑grained upstream traffic management.
Ensuring downstream service SLOs.
etcd already provides solid stability; Ant Group further improves resource utilization through scale‑out and scale‑up techniques.
Filesystem Upgrade
Switching from SATA disks to NVMe SSDs raised random write speed to >70 MiB/s. Using
tmpfsfor the event etcd cluster added another ~20% performance gain. Changing the underlying filesystem from ext4 to XFS with a 16 KiB block size gave modest write improvements, while further gains now come from memory indexing.
etcd Tuning
Key tunable parameters include write batch size (default 10k) and interval (100 ms), and compaction settings (interval, sleep interval, batch limit). In large clusters the batch values should be reduced proportionally to node count, and compaction intervals can be lengthened (up to 1 hour) to avoid frequent pauses.
Compaction can be delegated to the kube‑apiserver layer, allowing dynamic adjustment based on traffic patterns (e.g., longer intervals during low‑load periods).
Operations Platform
The platform provides analysis functions such as longest N KV, top N KV, top N namespace, verb‑resource statistics, connection counts, client source stats, and redundant data analysis. Based on these insights Ant Group performs client rate‑limiting, load balancing, cluster splitting, redundant data deletion, and fine‑grained traffic analysis.
Splitting the event data into a separate etcd cluster reduces KV size and external traffic, improving RT and QPS. Removing cold data that has not been accessed for weeks can cut memory key count by ~20% and halve latency.
Future Roadmap
Beyond scale‑up, Ant Group plans to explore distributed etcd clusters (scale‑out) using both proxy‑based and proxy‑less designs. Proxy‑less clusters avoid the 20‑25% performance penalty of proxy routing. Additional work includes tracking upstream etcd features, optimizing compaction algorithms, and integrating multiboltdb architectures.
References: https://www.kubernetes.org.cn/9284.html ; https://tech.meituan.com/2020/08/13/openstack-to-kubernetes-in-meituan.html
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.