How to Keep Your Kubernetes Cluster Running When a Node Goes Down
This article explains the architecture and practical techniques for achieving high availability in Kubernetes clusters, covering control‑plane and worker‑node design, network service handling, connection reuse, node eviction, storage considerations, and application‑level strategies to ensure continuous service during node failures.
Overall Architecture
1. Control Plane Node
The control plane is built with a three‑node kubeadm HA setup using a stacked etcd topology. All master nodes run kube‑apiserver, kube‑controller‑manager, kube‑scheduler, and etcd, with the apiserver communicating locally to etcd and the controller‑manager/scheduler communicating via a load balancer.
2. Work Node
Applications are deployed on worker nodes with pod anti‑affinity so that replicas are spread across different nodes; if a node crashes, Kubernetes Service can route traffic to remaining replicas.
Network
1. Service Backend Removal
When a node fails, the corresponding pod is not removed from the Service immediately. In iptables mode this causes intermittent request failures until the endpoint controller updates the rules; ipvs mode can evict the dead pod faster.
The interval depends on node heartbeat timing. Each node updates a Lease object every
node-status-update-frequency(default 10s). The node‑controller checks the lease every
node-monitor-period(default 5s) and marks the node unhealthy after
node-monitor-grace-period(default 40s). After an additional default 5‑minute grace period, pods are evicted.
<code>(kubelet)node-status-update-frequency: default 10s
(kube-controller-manager)node-monitor-period: default 5s
(kube-controller-manager)node-monitor-grace-period: default 40s
- Amount of time which we allow running Node to be unresponsive before marking it unhealthy.
- Currently nodeStatusUpdateRetry is constantly set to 5 in kubelet.go</code>2. Connection Reuse
Long‑lived TCP connections can linger for up to 15 minutes after a node shutdown because the controller retries using TCP's ARQ mechanism. The fix is to set appropriate timeouts at the application level or adjust TCP parameters.
System‑wide adjustments:
<code># Reduce TCP retransmission timeout
$ echo 9 > /proc/sys/net/ipv4/tcp_retries2
# Enable TCP keepalive to close dead connections faster
$ echo 30 > /proc/sys/net/ipv4/tcp_keepalive_time
$ echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl
$ echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes</code>When using containers, apply the settings via an initContainer:
<code>- name: init-sysctl
image: busybox
command:
- /bin/sh
- -c
- |
sysctl -w net.ipv4.tcp_keepalive_time=30
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
securityContext:
privileged: true</code>Node Eviction
Pod eviction is governed by tolerations. By default Kubernetes adds tolerations for
node.kubernetes.io/not‑readyand
node.kubernetes.io/unreachablewith a 300‑second tolerationSeconds, meaning pods remain bound for 5 minutes after a node is marked not ready.
Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with toleration‑Seconds=300. These automatically‑added tolerations mean that Pods remain bound to Nodes for 5 minutes after one of these problems is detected.
When a node becomes unreachable, the node controller adds taints
node.kubernetes.io/unreachable(NoSchedule and NoExecute). Pods with matching tolerations survive for the configured period before being evicted.
StatefulSets wait for the kubelet to finish deletion, so a pod on a dead node may stay in
Terminatingstate indefinitely until the node is fully removed. Deployments create replacement pods immediately. DaemonSets and static pods include tolerations that usually prevent eviction.
Storage‑Related Eviction
When a pod uses a volume with ReadWriteOnce (RWO) and its node crashes, the volume remains attached, blocking pod creation on other nodes. The solution is to use ReadWriteMany (RWX) volumes or force‑delete the pod.
<code>$ kubectl delete pod/nginx-7b4d5d9fd-nqgfz --force --grace-period=0</code>The attach‑detach controller will force‑detach the volume after a hard‑coded 6‑minute wait. This timeout can be reduced by setting
--attach-detach-reconcile-max-wait-unmount-durationto a lower value.
Storage
1. System Storage – etcd
etcd achieves HA using the Raft consensus algorithm, relying on leader election and log replication to survive node failures.
2. Application Storage – Persistent Volumes
External storage remains available when a node fails. For CSI drivers, HA can be achieved by deploying the driver, identity controller, external‑attacher, and external‑provisioner with multiple replicas.
Application Layer
1. Stateless Applications
Deployments with anti‑affinity and multiple replicas provide HA. Example:
<code>apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine</code>2. Stateful Applications
Stateful workloads can use RWX volumes for HA, or implement custom HA (e.g., master‑slave databases). Redis HA with one master, two slaves, and three sentinels is illustrated below.
Etcd’s Raft algorithm diagram:
Kubernetes controllers themselves rely on distributed locks for HA.
Conclusion
The article presented a comprehensive view of Kubernetes high‑availability architecture and analyzed potential issues and solutions at the network, storage, and application layers when a node crashes, helping practitioners keep recovery times within acceptable limits.
References
[1] Stacked etcd topology: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#stacked-etcd-topology [2] Leader Election Mechanism: https://github.com/kubernetes/client-go/tree/master/examples/leader-election [3] Anti‑affinity: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#more-practical-use-cases [4] iptables proxy mode: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables [5] retry with another backend: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables [6] Kubernetes Controller 15‑minute timeout: https://duyanghao.github.io/kubernetes-ha-http-keep-alive-bugs/ [7] Tolerations: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions [8] Pods stuck in “Terminating”: https://github.com/kubernetes/kubernetes/issues/55713#issuecomment-518340883 [9] at most one semantics: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#statefulset-considerations [10] by kubelet: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#current-guarantees-for-pod-lifecycle [11] Author PR: https://github.com/kubernetes/kubernetes/pull/93776 [12] Raft consensus paper: https://www.infoq.cn/article/raft-paper/ [13] In Search of an Understandable Consensus Algorithm: https://ramcloud.atlassian.net/wiki/download/attachments/6586375/raft.pdf [14] add node shutdown KEP: https://github.com/kubernetes/enhancements/pull/1116/files [15] Recommended Mechanism for Deploying CSI Drivers: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md#recommended-mechanism-for-deploying-csi-drivers-on-kubernetes [16] Container Storage Interface (CSI): https://github.com/container-storage-interface/spec/blob/master/spec.md [17] Client should expose a mechanism to close underlying TCP connections: https://github.com/kubernetes/client-go/issues/374
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.