Cloud Native 21 min read

Kubernetes High Availability: Architecture, Network, Storage, and Application Strategies

The article explains how to achieve Kubernetes high availability by designing a three‑node control‑plane with stacked etcd, using pod anti‑affinity, tuning node‑monitor timers, handling stale endpoints, configuring TCP keep‑alive, managing node taints and eviction, and choosing RWX storage or appropriate StatefulSet strategies to minimize service disruption after node failures.

Tencent Cloud Developer

Aug 18, 2020

Kubernetes High Availability: Architecture, Network, Storage, and Application Strategies

In enterprise production environments, high availability (HA) is a mandatory feature of Kubernetes. This article reviews the problems and solutions encountered when a Kubernetes node fails, covering control‑plane design, network behavior, storage handling, and application‑level considerations.

1. Control Plane Node

The control plane is built with a three‑node kubeadm HA setup using a stacked etcd topology. All master nodes run kube-apiserver, kube-controller-manager, kube-scheduler and etcd. Communication between components is local, and a load balancer fronts the API servers. Leader election ensures that if any master node goes down, the remaining controllers continue to operate.

2. Work Node

Workloads are deployed with pod anti‑affinity so that replicas are spread across different nodes. If a node crashes, the service can still route traffic to the remaining replicas.

Network Impact of Node Failure

When a node disappears, the service’s endpoint list is not updated instantly. In iptables proxy mode the stale endpoint causes intermittent request failures until kube-proxy watches the endpoint change and updates the iptables rules. The delay is governed by the node‑lease and node‑monitor timers:

(kubelet) node-status-update-frequency: default 10s</code>
<code>(kube-controller-manager) node-monitor-period: default 5s</code>
<code>(kube-controller-manager) node-monitor-grace-period: default 40s</code>
<code>Amount of time which we allow running Node to be unresponsive before marking it unhealthy.

During the node-monitor-grace-period window, services may experience intermittent failures. Adjusting these parameters shortens the window.

Connection Reuse Issue

Long‑lived TCP connections can hang for up to 15 minutes after a node failure because the controller’s TCP ARQ mechanism keeps retrying. The root cause is missing timeout settings at the application layer. Solutions include setting appropriate request timeouts or health checks, and tuning kernel TCP parameters such as tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes:

# 0.2+0.4+0.8+1.6+3.2+6.4+12.8+25.6+51.2+102.4 = 222.6s</code>
<code>$ echo 9 > /proc/sys/net/ipv4/tcp_retries2

For pods, an init container can apply the sysctl settings:

- name: init-sysctl</code>
<code>  image: busybox</code>
<code>  command:</code>
<code>  - /bin/sh</code>
<code>  - -c</code>
<code>  - |</code>
<code>    sysctl -w net.ipv4.tcp_keepalive_time=30</code>
<code>    sysctl -w net.ipv4.tcp_keepalive_intvl=30</code>
<code>    sysctl -w net.ipv4.tcp_keepalive_probes=5</code>
<code>  securityContext:</code>
<code>    privileged: true

Node Eviction

When a node becomes unresponsive, the node controller adds node.kubernetes.io/not‑ready and node.kubernetes.io/unreachable taints. Pods tolerate these taints for 300 seconds (default) before being evicted:

tolerations:</code>
<code>  - effect: NoExecute</code>
<code>    key: node.kubernetes.io/not-ready</code>
<code>    operator: Exists</code>
<code>    tolerationSeconds: 300</code>
<code>  - effect: NoExecute</code>
<code>    key: node.kubernetes.io/unreachable</code>
<code>    operator: Exists</code>
<code>    tolerationSeconds: 300

StatefulSets keep pods in a “Terminating” state until the kubelet on the failed node can delete them, while Deployments immediately create replacement pods.

Storage Considerations

Volumes with ReadWriteOnce (RWO) can be attached to only one node. If the node crashes, other pods that need the same volume block. Switching to ReadWriteMany (RWX) volumes or forcing pod deletion resolves the issue:

$ kubectl delete pods/nginx-7b4d5d9fd-nqgfz --force --grace-period=0

The attach‑detach controller’s default 6‑minute wait can be reduced via the flag --attach-detach-reconcile-max-wait-unmount-duration:

--attach-detach-reconcile-max-wait-unmount-duration duration   maximum amount of time the attach‑detach controller will wait for a volume to be safely unmounted (default 6m0s)

Application‑Level HA

Stateless services (Deployments) achieve HA by using pod anti‑affinity and multiple replicas. Example Deployment:

apiVersion: apps/v1</code>
<code>kind: Deployment</code>
<code>metadata:</code>
<code>  name: web-server</code>
<code>spec:</code>
<code>  replicas: 3</code>
<code>  selector:</code>
<code>    matchLabels:</code>
<code>      app: web-store</code>
<code>  template:</code>
<code>    metadata:</code>
<code>      labels:</code>
<code>        app: web-store</code>
<code>    spec:</code>
<code>      affinity:</code>
<code>        podAntiAffinity:</code>
<code>          requiredDuringSchedulingIgnoredDuringExecution:</code>
<code>          - labelSelector:</code>
<code>              matchExpressions:</code>
<code>              - key: app</code>
<code>                operator: In</code>
<code>                values:</code>
<code>                - web-store</code>
<code>            topologyKey: "kubernetes.io/hostname"</code>
<code>      containers:</code>
<code>      - name: web-app</code>
<code>        image: nginx:1.16-alpine

Stateful applications (StatefulSets) rely on storage that supports RWX or implement their own HA (e.g., master‑slave, leader election). Example Redis HA with sentinel is mentioned.

Conclusion

The article outlines a comprehensive view of Kubernetes HA, from control‑plane design to network, storage, and application layers, and provides practical parameter tweaks to minimize service disruption after node failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes network Cluster storage Node Failure tolerations

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.