Operations 11 min read

Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture

This comprehensive guide presents a unified framework for diagnosing and resolving production incidents—covering CPU spikes, OOM, disk exhaustion, log overload, port failures, container crashes, Kubernetes pod issues, SSH attacks, I/O bottlenecks, MySQL connection limits, Redis memory saturation, message‑queue backlogs, deployment failures, certificate expirations, file‑handle exhaustion, time drift, mining malware, and DDoS—by providing rapid‑check commands, immediate remediation steps, root‑cause classification, and architectural safeguards.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture

The guide defines a consistent incident‑handling workflow: symptom → quick‑check commands → emergency remediation ("stop‑bleeding") → root‑cause classification → architecture‑level prevention. It then enumerates common production problems with concrete Linux, Docker, and Kubernetes commands, diagnostic tips, and long‑term safeguards.

0️⃣ Universal Quick‑Check Overview

CPU/Memory/Disk status: uptime, top, free -m, df -h,

ss -lntup | head

1️⃣ CPU 100% / High Load

Symptom : interface timeout, load surge, CPU 800%+

Quick‑check : top, ps aux --sort=-%cpu | head, top -Hp PID Remediation : kill -9 PID, systemctl restart app Root‑cause : infinite loop, Full GC, thread‑pool exhaustion, regex catastrophe

Prevention : CPU limits, circuit‑breaker/flow‑control, thread‑pool monitoring, expose JVM metrics to Prometheus

2️⃣ Memory Exhaustion / OOM

Symptom : service restart, Pod OOMKilled

Quick‑check : free -m, dmesg | tail, ps aux --sort=-%mem | head, kubectl describe pod xxx | grep -i oom Remediation :

kubectl set resources deployment app \
--limits=memory=2Gi --requests=memory=1Gi

Root‑cause : JVM Xmx > limit, memory leak, unbounded cache growth

Prevention : container resource specs, memory monitoring, heap‑dump analysis

3️⃣ Disk Full

Symptom : service cannot write files, MySQL "No space" error

Quick‑check : df -h, du -sh /* 2>/dev/null | sort -hr | head, lsof +L1 Remediation : rm -f /var/log/*.log, PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY); Root‑cause : log explosion, uncleaned binlog, temporary file accumulation

Prevention : logrotate, disk‑water‑level monitoring, partition isolation

4️⃣ Log Explosion

Symptom : dozens of GB of logs per day, disk fills

Quick‑check : du -sh /var/log/* Remediation : truncate with > app.log Root‑cause : debug logging left on, endless exception loops

Prevention : unified log levels, centralised ELK collection

5️⃣ Port Unreachable

Symptom : service runs but cannot be accessed

Quick‑check : ss -lntup | grep 8080, iptables -L -n, curl -I http://127.0.0.1:8080 Remediation : iptables -I INPUT -p tcp --dport 8080 -j ACCEPT Root‑cause : firewall block, service not listening

Prevention : security‑group standardisation, start‑up health‑check scripts

6️⃣ Docker Container Anomaly

Symptom : container keeps restarting

Quick‑check : docker ps -a, docker logs container Remediation : docker restart container Root‑cause : misconfiguration, dependent service not ready

Prevention : health‑check, dependency detection

7️⃣ Kubernetes Pod Issues

Symptom : CrashLoopBackOff, OOMKilled

Quick‑check : kubectl get pod, kubectl describe pod xxx, kubectl logs --previous xxx Remediation : kubectl rollout undo deployment app Root‑cause : config error, insufficient resources, bad image

Prevention : canary releases, resource specifications, CI YAML validation

8️⃣ SSH Brute‑Force

Symptom : rapid growth of /var/log/secure Quick‑check : grep "Failed password" /var/log/secure | tail Remediation : iptables -A INPUT -s IP -j DROP Root‑cause : public SSH exposure, weak passwords

Prevention : Fail2ban, disable root login, VPN‑only access

9️⃣ I/O Bottleneck

Symptom : low CPU but high response time (RT)

Quick‑check : iostat -x 1, iotop Remediation : restart slow‑query service, apply rate‑limiting

Root‑cause : slow SQL, disk bottleneck

Prevention : SSD storage, slow‑SQL monitoring

🔟 MySQL Connection Saturation

Symptom : "Too many connections" error

Quick‑check : show processlist; Remediation : kill 1234; Root‑cause : connection pool not released, traffic surge

Prevention : connection‑pool monitoring, circuit‑breaker

1️⃣1️⃣ Nginx 502/504

Symptom : 502 page

Quick‑check : tail -f /var/log/nginx/error.log Remediation : systemctl reload nginx Root‑cause : backend service down, timeout too low

Prevention : upstream health checks, reasonable timeout settings

1️⃣2️⃣ Redis Memory Full

Symptom : write failures

Quick‑check : redis-cli info memory Remediation : redis-cli flushall # emergency only Root‑cause : missing TTL, hot keys

Prevention : enforce TTL, LRU eviction policy

1️⃣3️⃣ Message‑Queue Backlog

Quick‑check : kafka-consumer-groups.sh --describe, rabbitmqctl list_queues Root‑cause : consumer dead, insufficient consumption capacity

Prevention : consumption rate‑limiting, horizontal scaling of consumers

1️⃣4️⃣ Deployment Failure

Symptom : many 500 errors after release

Quick‑check : kubectl rollout history deployment app Remediation : kubectl rollout undo deployment app Root‑cause : config error, code defect

Prevention : canary rollout, automatic rollback

1️⃣5️⃣ HTTPS Certificate Expiry

Symptom : browser certificate error

Quick‑check :

openssl s_client -connect domain:443 | openssl x509 -noout -dates

Remediation : replace certificate immediately

Prevention : automatic renewal, expiry alerts

1️⃣6️⃣ File‑Handle Exhaustion

Symptom : "too many open files"

Quick‑check : lsof | wc -l, ulimit -n Prevention : raise fd limits, manage connection pools

1️⃣7️⃣ Time Skew

Symptom : token invalid, log timestamps out of order

Quick‑check : date, ntpq -p Prevention : NTP synchronization across hosts

1️⃣8️⃣ Mining Malware

Symptom : CPU 100% with xmrig process

Quick‑check : ps aux | grep xmrig Remediation : kill -9 PID Prevention : regular vulnerability scanning, minimise exposed ports

1️⃣9️⃣ Early‑Stage DDoS

Symptom : massive SYN flood

Quick‑check : ss -s Prevention : cloud scrubbing services, high‑availability IPs

2️⃣0️⃣ Automated Inspection Script (On‑Site Rescue)

#!/bin/bash
uptime
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
df -h
ss -lntup | head

The final block summarises the capability chain: "Incident detection → emergency mitigation → post‑mortem governance → architectural safeguards".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsKubernetesLinuxIncident Responseproduction
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.