Kubernetes High Availability: Architecture, Network, Storage, and Application Strategies
The article explains how to achieve Kubernetes high availability by designing a three‑node control‑plane with stacked etcd, using pod anti‑affinity, tuning node‑monitor timers, handling stale endpoints, configuring TCP keep‑alive, managing node taints and eviction, and choosing RWX storage or appropriate StatefulSet strategies to minimize service disruption after node failures.
In enterprise production environments, high availability (HA) is a mandatory feature of Kubernetes. This article reviews the problems and solutions encountered when a Kubernetes node fails, covering control‑plane design, network behavior, storage handling, and application‑level considerations.
1. Control Plane Node
The control plane is built with a three‑node kubeadm HA setup using a stacked etcd topology. All master nodes run kube-apiserver , kube-controller-manager , kube-scheduler and etcd . Communication between components is local, and a load balancer fronts the API servers. Leader election ensures that if any master node goes down, the remaining controllers continue to operate.
2. Work Node
Workloads are deployed with pod anti‑affinity so that replicas are spread across different nodes. If a node crashes, the service can still route traffic to the remaining replicas.
Network Impact of Node Failure
When a node disappears, the service’s endpoint list is not updated instantly. In iptables proxy mode the stale endpoint causes intermittent request failures until kube-proxy watches the endpoint change and updates the iptables rules. The delay is governed by the node‑lease and node‑monitor timers:
(kubelet) node-status-update-frequency: default 10s
(kube-controller-manager) node-monitor-period: default 5s
(kube-controller-manager) node-monitor-grace-period: default 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy.During the node-monitor-grace-period window, services may experience intermittent failures. Adjusting these parameters shortens the window.
Connection Reuse Issue
Long‑lived TCP connections can hang for up to 15 minutes after a node failure because the controller’s TCP ARQ mechanism keeps retrying. The root cause is missing timeout settings at the application layer. Solutions include setting appropriate request timeouts or health checks, and tuning kernel TCP parameters such as tcp_keepalive_time , tcp_keepalive_intvl , and tcp_keepalive_probes :
# 0.2+0.4+0.8+1.6+3.2+6.4+12.8+25.6+51.2+102.4 = 222.6s
$ echo 9 > /proc/sys/net/ipv4/tcp_retries2For pods, an init container can apply the sysctl settings:
- name: init-sysctl
image: busybox
command:
- /bin/sh
- -c
- |
sysctl -w net.ipv4.tcp_keepalive_time=30
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
securityContext:
privileged: trueNode Eviction
When a node becomes unresponsive, the node controller adds node.kubernetes.io/not‑ready and node.kubernetes.io/unreachable taints. Pods tolerate these taints for 300 seconds (default) before being evicted:
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300StatefulSets keep pods in a “Terminating” state until the kubelet on the failed node can delete them, while Deployments immediately create replacement pods.
Storage Considerations
Volumes with ReadWriteOnce (RWO) can be attached to only one node. If the node crashes, other pods that need the same volume block. Switching to ReadWriteMany (RWX) volumes or forcing pod deletion resolves the issue:
$ kubectl delete pods/nginx-7b4d5d9fd-nqgfz --force --grace-period=0The attach‑detach controller’s default 6‑minute wait can be reduced via the flag --attach-detach-reconcile-max-wait-unmount-duration :
--attach-detach-reconcile-max-wait-unmount-duration duration maximum amount of time the attach‑detach controller will wait for a volume to be safely unmounted (default 6m0s)Application‑Level HA
Stateless services (Deployments) achieve HA by using pod anti‑affinity and multiple replicas. Example Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
replicas: 3
selector:
matchLabels:
app: web-store
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpineStateful applications (StatefulSets) rely on storage that supports RWX or implement their own HA (e.g., master‑slave, leader election). Example Redis HA with sentinel is mentioned.
Conclusion
The article outlines a comprehensive view of Kubernetes HA, from control‑plane design to network, storage, and application layers, and provides practical parameter tweaks to minimize service disruption after node failures.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.