Operations 6 min read

Mastering Kubernetes Node Failures with Node Problem Detector

This article explains common Kubernetes component and node failures, introduces the Node Problem Detector (NPD) for monitoring and reporting issues, and provides step‑by‑step deployment, configuration, and remediation procedures to keep clusters reliable.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Mastering Kubernetes Node Failures with Node Problem Detector

Component Failures

Component failures are a subset of node failures, originating from Kubernetes core components.

DNS failure: Two of six DNS Pods cannot resolve external names, causing widespread service disruption.

CNI failure: Some nodes lose external network connectivity; pods on those nodes are reachable locally but not from other nodes, rendering health checks ineffective and causing request failures.

Kubenurse probes ingress, DNS, apiserver, and kube-proxy for network health.

Node Failures

Hardware errors: CPU, memory, disk failures.

Kernel issues: deadlocks, corrupted file systems.

Container runtime errors: Docker hangs.

Infrastructure service failures: NTP problems.

Node Problem Detector

Root cause: Kubernetes originally relied on stable nodes, but as it evolves into an OS, node management becomes essential, leading to the Node Problem Detector (NPD) project.

Kubernetes supports two reporting mechanisms:

NodeCondition – permanent errors that prevent pods from running on a node; cleared only after node reboot.

Event – temporary issues useful for diagnostics. NPD watches system logs (e.g., journal) and reports valuable events to the node, which can be sent to Prometheus, generate offline events, push to WeChat, or trigger automated remediation.

Example CNI failure remediation flow:

Search the operations playbook for a matching fix and execute the corresponding action.

If ineffective, delete the CNI‑related Pods on the node to reset routing and iptables.

If still ineffective, restart the container runtime.

If all else fails, raise an alert for manual intervention.

Deploying NPD requires a Kubernetes cluster with at least one worker node. See the GitHub repository for details.

<code>--prometheus-address: default 127.0.0.1; change to push to Prometheus.
--config.system-log-monitor: launches a separate log monitor per config (e.g., config/kernel-monitor.json).
--config.custom-plugin-monitor: launches a custom plugin monitor per config (e.g., config/custom-plugin-monitor.json).
</code>

Create the ConfigMap and DaemonSet:

<code>kubectl create -f node-problem-detector-config.yaml
kubectl create -f node-problem-detector.yaml
</code>

Validate NPD captures:

<code>sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
# Check for KernelOops event via kubectl describe node
sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"
# Check for DockerHung event via kubectl describe node
</code>

If events are sent to Prometheus, configure alerting rules to forward them to WeChat.

operationsKubernetesDNSCNINode Problem DetectorCluster Monitoring
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.