Operations 12 min read

Ensuring Stability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article shares practical experience on operating massive Kubernetes clusters, focusing on three stability questions, data collection and visualization, and a suite of operational tools to achieve reliable, high‑availability services in production environments.

JD Tech
JD Tech
JD Tech
Ensuring Stability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

Stability Three Questions

Stability is the lifeline of infrastructure; without it, advanced features are meaningless. Drawing from experience with OpenStack and Kubernetes, the authors propose three fundamental questions to assess cluster robustness before deployment.

Question 1: Does the failure or congestion of any component affect running containers?

In OpenStack, component failures have limited impact on containers because usage is static. In Kubernetes, the presence of controllers makes the situation more complex: a single failing apiserver can become a bottleneck, returning many 409 errors, causing kubelet heartbeats to stop, nodes marked not‑ready, and unnecessary pod restarts, even though the pods themselves remain healthy.

Similarly, failures in network metadata storage can disrupt container networking. Therefore, each component must be thoroughly tested for failure modes to prevent cascading effects.

Question 2: Can the cluster recover from any component failure?

High‑availability design and disaster‑recovery plans are essential for large‑scale clusters. Using etcd as an example, the authors describe multiple recovery strategies: restoring from original nodes, migrating data to new nodes, and restoring from scheduled backups, each with trade‑offs such as potential data loss.

Question 3: Are there alerts and response mechanisms for any component anomaly?

Since component failures are inevitable, comprehensive monitoring, alert rules, and automated remediation are required. Traditional application monitoring techniques can be applied to platform components, including resource usage alerts, health checks, and performance metric monitoring.

Operational Data and Visualization

Effective cluster operation relies on data rather than intuition. Metrics are collected from all components (e.g., apiserver request rates, latency) and stored in a TSDB for multi‑dimensional analysis.

In a test cluster of ~1,000 nodes and 25,000 pods, API QPS reached 8,500+. Analysis revealed that configmaps accounted for over 90% of requests; after redesigning configmap handling, API traffic dropped to 140 QPS.

Visualization of these metrics helps operators and developers quickly grasp cluster health, identify bottlenecks, and compare pre‑ and post‑optimization performance.

Operational Tools

Large‑scale operation requires a complete toolchain to reduce manual effort and increase automation.

Inspection Tools

Daily inspection systems detect abnormal configurations and states of physical machines and services, especially during high‑traffic periods. The inspection framework is plug‑in based, allowing flexible configuration of inspection points across components.

Other Tools

kubesql : Exposes Kubernetes resources as relational‑like tables, enabling SQL queries such as select count(metadata.name) from kubepod where metadata.namespace = 'default' .

Event Notification : Listens to events, categorizes them, and sends alerts via email or SMS for urgent issues.

Pod/Node Full Record : Tracks state changes of pods and nodes and stores them in a database for historical queries.

All these tools have been contributed back to the community, with some integrated into Ansible.

monitoringkubernetesstabilityLarge ScaleCluster OperationsOps Tools
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.