Cloud Native 23 min read

How to Keep Your Kubernetes Cluster Running When a Node Goes Down

This article explains the architecture and practical techniques for achieving high availability in Kubernetes clusters, covering control‑plane and worker‑node design, network service handling, connection reuse, node eviction, storage considerations, and application‑level strategies to ensure continuous service during node failures.

Efficient Ops
Efficient Ops
Efficient Ops
How to Keep Your Kubernetes Cluster Running When a Node Goes Down

Overall Architecture

1. Control Plane Node

The control plane is built with a three‑node kubeadm HA setup using a stacked etcd topology. All master nodes run kube‑apiserver, kube‑controller‑manager, kube‑scheduler, and etcd, with the apiserver communicating locally to etcd and the controller‑manager/scheduler communicating via a load balancer.

2. Work Node

Applications are deployed on worker nodes with pod anti‑affinity so that replicas are spread across different nodes; if a node crashes, Kubernetes Service can route traffic to remaining replicas.

Network

1. Service Backend Removal

When a node fails, the corresponding pod is not removed from the Service immediately. In iptables mode this causes intermittent request failures until the endpoint controller updates the rules; ipvs mode can evict the dead pod faster.

The interval depends on node heartbeat timing. Each node updates a Lease object every

node-status-update-frequency

(default 10s). The node‑controller checks the lease every

node-monitor-period

(default 5s) and marks the node unhealthy after

node-monitor-grace-period

(default 40s). After an additional default 5‑minute grace period, pods are evicted.

<code>(kubelet)node-status-update-frequency: default 10s
(kube-controller-manager)node-monitor-period: default 5s
(kube-controller-manager)node-monitor-grace-period: default 40s
- Amount of time which we allow running Node to be unresponsive before marking it unhealthy.
- Currently nodeStatusUpdateRetry is constantly set to 5 in kubelet.go</code>

2. Connection Reuse

Long‑lived TCP connections can linger for up to 15 minutes after a node shutdown because the controller retries using TCP's ARQ mechanism. The fix is to set appropriate timeouts at the application level or adjust TCP parameters.

System‑wide adjustments:

<code># Reduce TCP retransmission timeout
$ echo 9 > /proc/sys/net/ipv4/tcp_retries2
# Enable TCP keepalive to close dead connections faster
$ echo 30 > /proc/sys/net/ipv4/tcp_keepalive_time
$ echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl
$ echo 5  > /proc/sys/net/ipv4/tcp_keepalive_probes</code>

When using containers, apply the settings via an initContainer:

<code>- name: init-sysctl
  image: busybox
  command:
  - /bin/sh
  - -c
  - |
    sysctl -w net.ipv4.tcp_keepalive_time=30
    sysctl -w net.ipv4.tcp_keepalive_intvl=30
    sysctl -w net.ipv4.tcp_keepalive_probes=5
  securityContext:
    privileged: true</code>

Node Eviction

Pod eviction is governed by tolerations. By default Kubernetes adds tolerations for

node.kubernetes.io/not‑ready

and

node.kubernetes.io/unreachable

with a 300‑second tolerationSeconds, meaning pods remain bound for 5 minutes after a node is marked not ready.

Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with toleration‑Seconds=300. These automatically‑added tolerations mean that Pods remain bound to Nodes for 5 minutes after one of these problems is detected.

When a node becomes unreachable, the node controller adds taints

node.kubernetes.io/unreachable

(NoSchedule and NoExecute). Pods with matching tolerations survive for the configured period before being evicted.

StatefulSets wait for the kubelet to finish deletion, so a pod on a dead node may stay in

Terminating

state indefinitely until the node is fully removed. Deployments create replacement pods immediately. DaemonSets and static pods include tolerations that usually prevent eviction.

Storage‑Related Eviction

When a pod uses a volume with ReadWriteOnce (RWO) and its node crashes, the volume remains attached, blocking pod creation on other nodes. The solution is to use ReadWriteMany (RWX) volumes or force‑delete the pod.

<code>$ kubectl delete pod/nginx-7b4d5d9fd-nqgfz --force --grace-period=0</code>

The attach‑detach controller will force‑detach the volume after a hard‑coded 6‑minute wait. This timeout can be reduced by setting

--attach-detach-reconcile-max-wait-unmount-duration

to a lower value.

Storage

1. System Storage – etcd

etcd achieves HA using the Raft consensus algorithm, relying on leader election and log replication to survive node failures.

2. Application Storage – Persistent Volumes

External storage remains available when a node fails. For CSI drivers, HA can be achieved by deploying the driver, identity controller, external‑attacher, and external‑provisioner with multiple replicas.

Application Layer

1. Stateless Applications

Deployments with anti‑affinity and multiple replicas provide HA. Example:

<code>apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.16-alpine</code>

2. Stateful Applications

Stateful workloads can use RWX volumes for HA, or implement custom HA (e.g., master‑slave databases). Redis HA with one master, two slaves, and three sentinels is illustrated below.

Etcd’s Raft algorithm diagram:

Kubernetes controllers themselves rely on distributed locks for HA.

Conclusion

The article presented a comprehensive view of Kubernetes high‑availability architecture and analyzed potential issues and solutions at the network, storage, and application layers when a node crashes, helping practitioners keep recovery times within acceptable limits.

References

[1] Stacked etcd topology: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#stacked-etcd-topology [2] Leader Election Mechanism: https://github.com/kubernetes/client-go/tree/master/examples/leader-election [3] Anti‑affinity: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#more-practical-use-cases [4] iptables proxy mode: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables [5] retry with another backend: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables [6] Kubernetes Controller 15‑minute timeout: https://duyanghao.github.io/kubernetes-ha-http-keep-alive-bugs/ [7] Tolerations: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions [8] Pods stuck in “Terminating”: https://github.com/kubernetes/kubernetes/issues/55713#issuecomment-518340883 [9] at most one semantics: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#statefulset-considerations [10] by kubelet: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#current-guarantees-for-pod-lifecycle [11] Author PR: https://github.com/kubernetes/kubernetes/pull/93776 [12] Raft consensus paper: https://www.infoq.cn/article/raft-paper/ [13] In Search of an Understandable Consensus Algorithm: https://ramcloud.atlassian.net/wiki/download/attachments/6586375/raft.pdf [14] add node shutdown KEP: https://github.com/kubernetes/enhancements/pull/1116/files [15] Recommended Mechanism for Deploying CSI Drivers: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md#recommended-mechanism-for-deploying-csi-drivers-on-kubernetes [16] Container Storage Interface (CSI): https://github.com/container-storage-interface/spec/blob/master/spec.md [17] Client should expose a mechanism to close underlying TCP connections: https://github.com/kubernetes/client-go/issues/374

High AvailabilityKubernetesnetworkClusterstorageNode Failure
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.