Operations 12 min read

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article explains why operating massive Kubernetes clusters is as challenging as building large systems, outlines three critical stability questions, shares real‑world data collection, visualization, and tooling practices, and provides concrete recommendations for high‑availability, monitoring, and performance optimization.

JD Tech Talk

Aug 9, 2018

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

Some claim that Kubernetes is already mature, open‑source, and equipped with many deployment and monitoring tools, so cluster operations should be easy. In reality, large‑scale cluster operation demands extensive experience, a mature framework, and auxiliary toolchains, making it as difficult as developing a large‑scale system and requiring fine‑grained, strict operational practices.

Because the work deals directly with production environments, even minor oversights can cause catastrophic failures. Before moving to production, teams should ask whether they are truly prepared for massive clusters; many feel lost, so this article shares practical experience to help them get started.

Stability – Three Key Questions

Question 1: Does the failure or congestion of any component affect already‑running containers?

Kubernetes consists of many independent components (cluster management, networking, storage, image handling, etc.). In OpenStack, component failures have limited impact on running containers because the usage is relatively static. In Kubernetes, the presence of controllers introduces higher risk: for example, if several apiserver nodes fail and only one remains, a flood of requests from kubelet, scheduler, controllers, and external clients can overload the apiserver, returning many 409 errors and causing kubelet heartbeats to stop, nodes to be marked NotReady, and services to lose traffic, even though the containers themselves remain healthy.

Similarly, a network metadata store failure could disrupt container networking. Overall, any component’s failure or congestion can lead to container recreation, network outages, or cascading failures, so each component must be rigorously tested and its behavior fully understood.

Question 2: Can the cluster recover from any component failure?

Component failures become more likely at scale, so high‑availability designs and disaster‑recovery plans are essential. Critical components need predefined recovery procedures and regular drills. Using etcd as an example, multiple recovery strategies were practiced: restoring from the original etcd nodes, migrating data to new nodes, or restoring from scheduled backups (which may lose some data). Even the worst‑case migration was rehearsed to ensure cluster stability.

Question 3: Does every component have proper alerts and handling mechanisms?

Since component failures are inevitable, robust monitoring and alerting are required. Alerts should be defined for resource usage, health checks, and performance metrics of both the underlying machines and the components themselves. Traditional application monitoring techniques can be applied to these platform components to enable automated remediation.

Operational Data and Visualization

Effective cluster operation relies on data rather than intuition. Metrics are collected from all components to reflect status and performance, guiding capacity planning.

Data sources include component‑reported metrics, exposed APIs, and log analysis. For apiserver, metrics such as request per second, latency, request method, resource type, namespace, and source IP are aggregated using a time‑series database for multi‑dimensional analysis.

In a test cluster of ~1,000 nodes and 25,000 pods (with 15,000 ConfigMaps), raw API QPS reached over 8,500. After analyzing that ConfigMaps accounted for >90% of requests, the team refactored ConfigMap handling to support dynamic and static mounts, reducing API traffic by 98% to around 140 QPS.

Visualizing this data helps operators quickly grasp cluster health, spot trends, and compare pre‑ and post‑optimization performance, while also providing developers with valuable insights.

etcd capacity: Default 2 GB can become a bottleneck; separate etcd clusters per resource and raise capacity limits.

etcd read rate: Since most API calls are GET, a Redis cache is used to offload reads.

Scheduler latency: Predict and priority functions were trimmed and adaptive step sizing introduced, improving throughput with minimal accuracy loss.

Image store scaling: A custom ContainerFS with caching and P2P acceleration ensures scalable image distribution.

Operations Tools

Large‑scale operations require a complete toolchain to reduce manual effort and enable automation, ensuring cluster stability.

Inspection Tools

Daily inspection systems detect abnormal configurations and states of physical machines and services, especially during high‑traffic events. The modular, plug‑in architecture allows flexible inspection points across components, checking configurations, statuses, and parameters.

The inspection plugin has been contributed to the community and merged into Ansible.

Other Tools

Additional utilities include:

kubesql: Exposes Kubernetes resources (Pod, Service, Node, etc.) as relational‑like tables, enabling SQL queries such as

SELECT COUNT(metadata.name) FROM kubepod WHERE metadata.namespace='default'

event notification: Listens to events, categorizes them, and sends alerts via email or SMS for urgent incidents.

pod/node full record: Tracks state changes of pods and nodes, persisting them to a database for historical queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation Observability Kubernetes stability cluster operations

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.