Cloud Native 11 min read

Why Does PLEG ‘Not Healthy’ Make a Kubernetes Node NotReady?

This article explains the role of the Pod Lifecycle Event Generator (PLEG) in Kubelet, why the “PLEG is not healthy” error causes nodes to become NotReady, common failure scenarios, and a step‑by‑step troubleshooting method that ultimately resolves the issue by upgrading systemd.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Why Does PLEG ‘Not Healthy’ Make a Kubernetes Node NotReady?

Problem Description

Environment: Ubuntu 18.04 with a self‑built Kubernetes 1.18 cluster using Docker as the container runtime. A node repeatedly becomes NotReady, and

kubectl describe node

shows the error “PLEG is not healthy: pleg was last seen active 3m46.752815514s ago; threshold is 3m0s”, occurring every 5‑10 minutes.

What is PLEG?

PLEG (Pod Lifecycle Event Generator) is a module inside Kubelet. Its main responsibility is to watch pod‑level events, reconcile the container runtime state, and update the pod cache so that the cache reflects the latest pod status.

Kubelet runs on each node and must react promptly to two sources of change:

Changes defined in the pod spec.

Changes in the container runtime state.

To keep the cache up‑to‑date, Kubelet watches the pod spec and periodically polls the container runtime (default every 10 seconds). As the number of pods grows, this polling creates significant CPU overhead and can overload the runtime, reducing reliability and scalability. PLEG was introduced to reduce this overhead by limiting work during idle periods and decreasing concurrent runtime queries.

Kubelet and pod relationship diagram
Kubelet and pod relationship diagram

Kubelet acts both as a cluster controller (fetching resources from the API server and driving pod execution) and as a node‑status monitor (periodically reporting node health to the API server). The NodeStatus mechanism relies heavily on PLEG to decide whether a node is Ready.

PLEG Workflow

PLEG periodically checks the state of pods on the node. If it detects changes, it packages them into events and sends them to Kubelet’s main sync loop. When PLEG cannot perform its checks in time, NodeStatus marks the node as NotReady.

PLEG workflow diagram
PLEG workflow diagram

Why “PLEG is not healthy” Happens

The error indicates that the container runtime (Docker daemon) is unhealthy, causing PLEG to fail its health checks. Historically Docker used a monolithic daemon, but modern Docker delegates lifecycle management to

containerd

and

runc

. PLEG checks the runtime by invoking

runc

’s

relist()

, which is similar to running

docker ps

and

docker inspect

on all containers.

Docker runtime architecture diagram
Docker runtime architecture diagram

Common Scenarios Leading to the Error

Container runtime becomes unresponsive or times out (e.g., Docker daemon hangs).

Too many containers on the node, causing the relist process to exceed the 3‑minute timeout.

A deadlock bug in relist (fixed in Kubernetes 1.14).

Network issues.

Investigation Steps

1. On the problematic node,

top

shows a

scope

process consuming 100 % CPU. This is a systemd.scope unit that manages a group of external processes.

2.

docker ps

hangs, confirming the runtime is stuck.

3. Reference Alibaba’s Kubernetes troubleshooting guide, which links the issue to systemd.

What is D‑Bus?

D‑Bus is an inter‑process communication mechanism on Linux.

D‑Bus architecture
D‑Bus architecture

RunC and D‑Bus Interaction

RunC (the container runtime) writes to a D‑Bus socket with an

org.freedesktop

field, where it can become blocked.

RunC blocked on D‑Bus
RunC blocked on D‑Bus

Resolution

Restarting systemd with

systemctl daemon-reexec

clears the blockage, and the node returns to Ready. The root cause is a bug in the systemd version; upgrading systemd to v242‑rc2 and rebooting the host resolves the issue permanently.

Summary

The “PLEG is not healthy” error is often triggered by a malfunctioning systemd, which prevents the container runtime from responding. Upgrading systemd to a newer version and restarting it fixes the problem, restoring node health.

Kubelet: Pod Lifecycle Event Generator (PLEG)

Kubelet: Runtime Pod Cache

relist() in

kubernetes/pkg/kubelet/pleg/generic.go

Past bug about CNI — PLEG is not healthy error, node marked NotReady

https://www.infoq.cn/article/t_ZQeWjJLGWGT8BmmiU4

https://cloud.tencent.com/developer/article/1550038

kubernetestroubleshootingsystemdPLEGNodeNotReady
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.