Cloud Native 47 min read

7 Quick Ways to Diagnose a Kubernetes Pod Stuck in Pending

When a Kubernetes Pod remains in the Pending state, this guide walks through seven systematic troubleshooting directions—covering node resource shortages, taints and tolerations, node selectors and affinity, PVC binding issues, image pull problems, quota limits, and priority or topology constraints—providing concrete commands, examples, and remediation steps to get the pod running.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
7 Quick Ways to Diagnose a Kubernetes Pod Stuck in Pending

Problem background

A Pod is the smallest scheduling unit in Kubernetes. After kubectl apply -f pod.yaml a Pod goes through Pending → Running → Succeeded/Failed. If the Pending phase lasts seconds it is normal; if it lasts tens of seconds or minutes the scheduling or container creation steps are blocked.

Scheduling chain:

User creates Pod → API Server writes to etcd

Scheduler watches unscheduled Pods, filters nodes, scores candidates and writes the chosen node name to spec.nodeName Kubelet on the target node pulls the image, creates the container and starts it

If a Pod stays in Pending the problem is either the Scheduler cannot find a suitable node (step 2) or the Kubelet fails before the container starts (step 3).

Applicable scenarios

New workload on a cluster with unknown capacity

Application version upgrade that changes image size or resource requests

Node maintenance causing massive rescheduling and resource pressure

New taints or affinity rules added without updating existing Pods

Namespace quota or limit range exhausted after an admin adjustment

StatefulSet PVC with wrong StorageClass or access mode

Core commands (quick reference)

kubectl describe pod <pod-name> -n <namespace>

– inspect Events for the first clue

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

– view recent events kubectl top nodes / kubectl top pods -n <namespace> – show resource usage (requires Metrics Server)

kubectl get pod <pod-name> -n <namespace> -o yaml

– inspect spec.nodeName, nodeSelector, affinity, tolerations, resources,

status.conditions
kubectl get nodes -o wide

and kubectl describe node <node-name> – node labels, taints, conditions and allocated resources kubectl get pvc -n <namespace> and kubectl describe pvc <pvc-name> -n <namespace> – PVC status kubectl get resourcequota -n <namespace> and

kubectl describe resourcequota <quota-name> -n <namespace>

– quota usage kubectl get limitrange -n <namespace> and

kubectl describe limitrange <name> -n <namespace>

– default requests/limits

Direction 1 – Node resource insufficiency

Symptoms : Events contain Insufficient cpu, Insufficient memory or Insufficient ephemeral-storage.

The Scheduler filters nodes based on the sum of requests (not limits). If the total requested CPU or memory on a node reaches its capacity the Scheduler rejects the Pod even if actual usage is low.

Steps :

Show the last lines of kubectl describe pod <pod> -n <ns> to see the warning.

Check the Pod’s resources.requests with

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources.requests}'

. Reduce oversized requests (e.g. memory: 16Gi on an 8 Gi node).

If the Namespace has a LimitRange that supplies default requests, inspect it with kubectl get limitrange -n <ns> and kubectl describe limitrange <name> -n <ns>.

Inspect node‑level allocated resources: kubectl describe nodes | grep -A5 "Allocated resources". Example output shows cpu 3500m (87%) and memory 28Gi (93%). Ensure the new Pod’s requests fit into the remaining capacity.

Optionally view each node’s usage with kubectl top nodes. If requests are far higher than actual usage, lower them or enable a Vertical Pod Autoscaler.

Fixes (priority order) :

Option A : Decrease the Pod’s resources.requests (fast, no risk).

Option B : Expand the cluster (add nodes or increase node size).

Option C : Evict low‑priority Pods ( kubectl get pods --all-namespaces --sort-by='.spec.priority' then delete after business confirmation).

Direction 2 – Taints & Tolerations

Symptoms : Events contain messages like

0/5 nodes are available: 1 node(s) had taint {key: value}, that the pod didn’t tolerate

.

A taint on a node rejects Pods that lack a matching toleration. Common taints include the default master taint node-role.kubernetes.io/master:NoSchedule, custom GPU taints, or automatically added taints for node health issues.

Steps :

List node taints:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.taints}{"
"}{end}'

or kubectl describe node <node> | grep -A5 "Taints".

Show the Pod’s tolerations:

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.tolerations}'

.

Compare taints and tolerations; add missing tolerations to the Pod or remove the offending taint.

Check node Conditions that automatically add taints (e.g. MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable) via kubectl describe node <node> | grep -A3 "Conditions:".

Fixes :

Add the required toleration to the Pod/Deployment (YAML snippet omitted for brevity).

Remove a stale taint:

kubectl taint nodes <node> node.kubernetes.io/disk-pressure:NoSchedule-

(note the trailing -).

If the node was manually cordoned, uncordon it: kubectl uncordon <node>.

Direction 3 – NodeSelector & Affinity

Symptoms : Events contain

0/5 nodes are available: 5 node(s) didn’t match node selector

or similar affinity messages. nodeSelector is a simple label match; nodeAffinity adds expressive rules ( requiredDuringSchedulingIgnoredDuringExecution vs preferredDuringSchedulingIgnoredDuringExecution). PodAffinity/PodAntiAffinity control co‑location.

Steps :

Show the Pod’s selector and affinity:

kubectl get pod <pod> -n <ns> -o yaml | grep -A10 "nodeSelector\|affinity"

.

List node labels: kubectl get nodes --show-labels or inspect a specific node with kubectl describe node <node> | grep -A2 "Labels:".

Cross‑check required labels. If the Pod asks for nodeSelector: disktype: ssd but no node has that label, the Scheduler rejects it.

If the affinity rule is too strict, consider changing requiredDuringScheduling to preferredDuringScheduling or relax the match expression.

For PodAntiAffinity, ensure the topology key and replica count allow placement (e.g., three replicas on two nodes will block the third).

Fixes :

Add the missing label to a node: kubectl label node <node> disktype=ssd or edit the Pod/Deployment to remove the selector.

Convert a hard requirement to a soft preference by switching to preferredDuringSchedulingIgnoredDuringExecution.

Relax anti‑affinity rules or increase node count.

Direction 4 – PersistentVolumeClaim binding issues

Symptoms : The Pod stays Pending while the PVC is Pending. The Pod’s Events may not show PVC errors, so the PVC must be inspected directly.

Root causes include missing StorageClass, failing provisioner, mismatched PV capacity or access mode, or an exhausted PV pool.

Steps :

Check PVC status: kubectl get pvc -n <ns>. If Status: Pending, run kubectl describe pvc <pvc> -n <ns> and look at the Events section.

Verify the referenced StorageClass exists: kubectl get storageclass and kubectl describe storageclass <sc>.

For static PVs, list PVs with kubectl get pv and ensure STATUS is Available, capacity matches the claim, and storageClassName aligns.

Check that the provisioner pod for the StorageClass (e.g., AWS EBS CSI, NFS) is running:

kubectl get pods -n kube-system | grep -E "ebs|csi|nfs|provisioner"

.

Fixes :

Create the missing StorageClass (example YAML omitted).

Delete a mismatched PVC and recreate it with correct spec.resources.requests.storage and spec.storageClassName.

If the CSI driver pod is failing, inspect its logs and fix IAM permissions, API limits, or quota issues.

Direction 5 – Image pull problems

Symptoms : The Pod appears Pending but quickly moves to ContainerCreating. Events contain Failed to pull image, ErrImagePull or ImagePullBackOff.

Steps :

Confirm the Pod’s current phase:

kubectl describe pod <pod> -n <ns> | grep -A20 "Conditions:"

.

Search Events for image‑related messages:

kubectl describe pod <pod> -n <ns> | grep -A5 -E "Pull|image|Image"

.

Check imagePullSecrets:

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.imagePullSecrets}'

. If empty and the registry is private, create a Docker registry secret and attach it to the ServiceAccount.

Manually pull the image on the target node (Docker or crictl) to verify network and credentials.

Validate that the image tag exists in the registry; correct typos such as latestslatest.

If the registry is unreachable, test connectivity with curl -v https://<registry-host>/v2/ and DNS with nslookup <registry-host>.

Fixes :

Correct the image tag or repository name.

Create or fix the imagePullSecrets (e.g., kubectl create secret docker-registry regcred ... and patch the ServiceAccount).

Ensure firewall rules allow outbound traffic to the registry (usually port 443) and that the node has internet access.

Direction 6 – ResourceQuota / LimitRange restrictions

Symptoms : Events show errors like

exceeded quota: compute-resources, requested: cpu=500m,memory=2Gi

. The Pod may never enter the scheduler queue because admission control rejects it.

Principle : ResourceQuota caps namespace‑wide CPU, memory, PVC count, etc. LimitRange supplies default requests/limits and per‑container caps.

Steps :

Inspect ReplicaSet events (if the Pod was never created) with kubectl describe rs <rs> -n <ns> | tail -20.

List and describe the namespace’s quotas: kubectl get resourcequota -n <ns> and kubectl describe resourcequota <quota> -n <ns>. Example output shows used vs hard limits for requests.cpu, requests.memory, etc.

Check LimitRange defaults: kubectl get limitrange -n <ns> and kubectl describe limitrange <name> -n <ns>. If the default request exceeds the remaining quota, the Pod will be rejected.

Fixes :

Option A : Increase the quota limits (e.g., kubectl edit resourcequota <quota> -n <ns> or

kubectl patch resourcequota <quota> -n <ns> --patch '{"spec":{"hard":{"requests.cpu":"6","requests.memory":"24Gi"}}}'

).

Option B : Clean up unused Pods, Jobs, or completed workloads to free quota.

Option C : Reduce the Pod’s requests to fit within the existing quota.

Direction 7 – PriorityClass preemption & TopologySpreadConstraints

Symptoms : Pod stays Pending with no obvious resource, taint or selector errors. The cluster may be using priority‑based preemption or strict topology spread rules.

Principle : A high‑priority Pod can preempt lower‑priority Pods unless a PodDisruptionBudget blocks eviction. TopologySpreadConstraints enforce even distribution; with DoNotSchedule the Scheduler will refuse placement if the constraint cannot be satisfied.

Steps :

Check the Pod’s priority class:

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.priorityClassName}'

and then kubectl describe priorityclass <pc> for value and globalDefault.

Search for preemption events:

kubectl get events -n <ns> --sort-by='.lastTimestamp' | grep Preempt

. Look for Preempting or FailedPreemption.

If a PodDisruptionBudget limits eviction, describe it:

kubectl get pdb -n <ns>; kubectl describe pdb <pdb> -n <ns>

.

Inspect topology constraints:

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.topologySpreadConstraints}' | python3 -m json.tool

. Pay attention to maxSkew, topologyKey and whenUnsatisfiable (DoNotSchedule vs ScheduleAnyway).

Verify actual Pod distribution across the topology key (e.g., zones) with kubectl get pods -n <ns> -l app=<app> -o wide or a JSONPath that prints node names.

Fixes :

Adjust the PriorityClass value or remove restrictive PDB settings, or increase the number of replicas of the lower‑priority workload to allow preemption.

Relax the topology constraint by changing whenUnsatisfiable to ScheduleAnyway or increasing maxSkew.

If the cluster has only one zone, change topologyKey to kubernetes.io/hostname or another appropriate key.

Additional troubleshooting techniques

Scheduler logs :

kubectl logs -n kube-system -l component=kube-scheduler --tail=100

. Identify the leader pod and increase verbosity with --v=4 if needed.

Node readiness and schedulability :

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.unschedulable}{"\t"}{range .status.conditions[?(@.type=="Ready")]}{.status}{end}{"
"}{end}'

to spot cordoned or NotReady nodes.

Resource units : CPU is expressed in millicores ( 500m = 0.5 CPU). Memory uses binary units ( Mi, Gi).

Container runtime differences : For clusters using containerd, use crictl or ctr instead of Docker commands.

Dry‑run validation : kubectl apply -f pod.yaml --dry-run=server -n <ns> checks admission controllers without creating the object.

SchedulingGate (K8s 1.26+) : Verify spec.schedulingGates is empty; otherwise the Scheduler skips the Pod.

Production‑level operational guidelines

Risks of evicting Pods

Check for PodDisruptionBudget protection before using kubectl drain or deleting Pods.

Confirm Service endpoints have multiple ready Pods to avoid downtime.

Be aware that draining a node removes all Pods on that node; perform node‑by‑node drainage during low‑traffic windows.

Risks of changing scheduling policies

Modifying taints, node selectors or affinity in a Deployment triggers a rolling update.

Editing a node directly ( kubectl edit node) takes effect immediately for new scheduling but does not evict existing Pods.

Always back up the current Deployment YAML before changes.

Expanding nodes

In cloud environments nodes become Ready within minutes; in self‑managed clusters allow extra time.

Verify node status with kubectl get nodes before confirming the issue is resolved.

If using Cluster Autoscaler, check its logs for errors.

Adjusting ResourceQuota

Ensure the cluster actually has the resources you are allocating.

Prefer patching the quota over deleting it.

After increasing a quota, recreate affected Pods or trigger a rollout.

Execution window recommendations

Adjust Deployment requests/limits – low‑traffic period (medium risk). kubectl drain a node – low‑traffic period (high risk).

Modify node taints – maintenance window (high risk).

Delete PVC – maintenance window (critical, possible data loss).

Adjust ResourceQuota – any time (low risk).

Add node label – any time (low risk).

Scale out nodes – any time (low risk).

Summary

Start with kubectl describe pod and examine Events for the first clue.

If the event points to resource shortage, lower requests or add capacity.

If it points to a taint, add the matching toleration or remove the taint.

If it points to a selector or affinity, align node labels or relax the rule.

If no clear event appears, check PVC status, ResourceQuota, image pull errors, priority/preemption, and topology constraints.

After any fix, verify that a Scheduled event appears and the Pod reaches Running.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesSchedulingPodPVCResourceQuotaAffinityTaintsPending
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.