Cloud Native 8 min read

How a Misconfigured Liveness Probe Crashed a Service – Lessons & Fixes

An overnight outage at a financial firm, caused by a misconfigured Kubernetes liveness probe that returned 200 before the app was ready, led to massive losses; the article explains the difference between liveness and readiness probes, proper configuration examples, real‑world scenarios, troubleshooting steps, and best‑practice recommendations to avoid similar failures.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
How a Misconfigured Liveness Probe Crashed a Service – Lessons & Fixes

Incident Overview

At 02:00 a monitoring system reported a payment service outage. Multiple micro‑services entered CrashLoopBackOff, causing >1 M CNY loss in 37 minutes. The root cause was a mis‑configured liveness probe that returned 200 before the application was fully ready.

Understanding Liveness and Readiness Probes

Liveness Probe

Question: Is the application dead?

Failure consequence: Kubernetes restarts the container.

Typical use‑cases: Detect deadlocks, thread hangs, unrecoverable exceptions.

Key principle: Must be idempotent and have no side effects.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # > longest startup time
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Readiness Probe

Question: Is the application ready to receive traffic?

Failure consequence: Pod is removed from Service endpoints.

Typical use‑cases: Slow start‑up, initialization of resources, external dependencies not ready.

Key principle: Prevent traffic from reaching a pod that cannot serve requests.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 2

One‑line difference: Liveness checks “alive”, readiness checks “can work”.

Practical Health‑Check Strategies

Common Probe Types

HTTP GET: Most common for web services; a 200 response indicates success.

Exec: Runs a command inside the container, e.g. cat /tmp/healthy.

TCP Socket: Checks port connectivity, useful for databases, Redis, etc.

Real‑World Scenarios and Best Practices

Scenario 1 – Long Startup Dependencies

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 20
  failureThreshold: 10   # give dependent services enough time

When the application needs extra time to start, increase initialDelaySeconds and failureThreshold so the pod is not marked unready prematurely.

Scenario 2 – Over‑eager Liveness Probe

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 30
  failureThreshold: 2
Frequent checks with a short timeout can kill a healthy pod.

Scenario 3 – Mixed Probe for a Database Service

livenessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 300

readinessProbe:
  exec:
    command:
      - mysql
      - -h127.0.0.1
      - -e
      - "SELECT 1"

First verify the container is alive, then verify the DB can answer queries.

Common Pitfalls & Remedies

Probe too sensitive: Pods restart endlessly – increase failureThreshold and initialDelaySeconds.

Wrong readiness path: Pod never becomes ready – ensure the endpoint returns 200.

Heavy probe logic: Causes CPU spikes – use lightweight checks such as in‑memory flags.

Ignoring startup order: Dependencies not ready – use Init Containers or delayed strategies.

Fast Troubleshooting Checklist

Inspect Probe Events

kubectl describe pod POD_NAME

Look for messages like “Liveness probe failed” or “Readiness probe failed”.

Check Pod Logs

kubectl logs POD_NAME

Verify Pod Ready State

kubectl get pods -o wide

The READY column shows ready containers / total containers .

Conclusion

Liveness and readiness probes are core mechanisms for self‑healing and high availability in Kubernetes. Proper configuration enables automatic recovery, zero‑downtime deployments, traffic protection, and overall system resilience, while a single mis‑configuration can cause catastrophic outages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesReadiness ProbeLiveness Probe
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.