Ultimate Production Incident Response Handbook: Quick Commands, Root Cause Analysis, and Preventive Architecture
This comprehensive guide presents a unified framework for diagnosing and resolving production incidents—covering CPU spikes, OOM, disk exhaustion, log overload, port failures, container crashes, Kubernetes pod issues, SSH attacks, I/O bottlenecks, MySQL connection limits, Redis memory saturation, message‑queue backlogs, deployment failures, certificate expirations, file‑handle exhaustion, time drift, mining malware, and DDoS—by providing rapid‑check commands, immediate remediation steps, root‑cause classification, and architectural safeguards.
The guide defines a consistent incident‑handling workflow: symptom → quick‑check commands → emergency remediation ("stop‑bleeding") → root‑cause classification → architecture‑level prevention. It then enumerates common production problems with concrete Linux, Docker, and Kubernetes commands, diagnostic tips, and long‑term safeguards.
0️⃣ Universal Quick‑Check Overview
CPU/Memory/Disk status: uptime, top, free -m, df -h,
ss -lntup | head1️⃣ CPU 100% / High Load
Symptom : interface timeout, load surge, CPU 800%+
Quick‑check : top, ps aux --sort=-%cpu | head, top -Hp PID Remediation : kill -9 PID, systemctl restart app Root‑cause : infinite loop, Full GC, thread‑pool exhaustion, regex catastrophe
Prevention : CPU limits, circuit‑breaker/flow‑control, thread‑pool monitoring, expose JVM metrics to Prometheus
2️⃣ Memory Exhaustion / OOM
Symptom : service restart, Pod OOMKilled
Quick‑check : free -m, dmesg | tail, ps aux --sort=-%mem | head, kubectl describe pod xxx | grep -i oom Remediation :
kubectl set resources deployment app \ --limits=memory=2Gi --requests=memory=1GiRoot‑cause : JVM Xmx > limit, memory leak, unbounded cache growth
Prevention : container resource specs, memory monitoring, heap‑dump analysis
3️⃣ Disk Full
Symptom : service cannot write files, MySQL "No space" error
Quick‑check : df -h, du -sh /* 2>/dev/null | sort -hr | head, lsof +L1 Remediation : rm -f /var/log/*.log, PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY); Root‑cause : log explosion, uncleaned binlog, temporary file accumulation
Prevention : logrotate, disk‑water‑level monitoring, partition isolation
4️⃣ Log Explosion
Symptom : dozens of GB of logs per day, disk fills
Quick‑check : du -sh /var/log/* Remediation : truncate with > app.log Root‑cause : debug logging left on, endless exception loops
Prevention : unified log levels, centralised ELK collection
5️⃣ Port Unreachable
Symptom : service runs but cannot be accessed
Quick‑check : ss -lntup | grep 8080, iptables -L -n, curl -I http://127.0.0.1:8080 Remediation : iptables -I INPUT -p tcp --dport 8080 -j ACCEPT Root‑cause : firewall block, service not listening
Prevention : security‑group standardisation, start‑up health‑check scripts
6️⃣ Docker Container Anomaly
Symptom : container keeps restarting
Quick‑check : docker ps -a, docker logs container Remediation : docker restart container Root‑cause : misconfiguration, dependent service not ready
Prevention : health‑check, dependency detection
7️⃣ Kubernetes Pod Issues
Symptom : CrashLoopBackOff, OOMKilled
Quick‑check : kubectl get pod, kubectl describe pod xxx, kubectl logs --previous xxx Remediation : kubectl rollout undo deployment app Root‑cause : config error, insufficient resources, bad image
Prevention : canary releases, resource specifications, CI YAML validation
8️⃣ SSH Brute‑Force
Symptom : rapid growth of /var/log/secure Quick‑check : grep "Failed password" /var/log/secure | tail Remediation : iptables -A INPUT -s IP -j DROP Root‑cause : public SSH exposure, weak passwords
Prevention : Fail2ban, disable root login, VPN‑only access
9️⃣ I/O Bottleneck
Symptom : low CPU but high response time (RT)
Quick‑check : iostat -x 1, iotop Remediation : restart slow‑query service, apply rate‑limiting
Root‑cause : slow SQL, disk bottleneck
Prevention : SSD storage, slow‑SQL monitoring
🔟 MySQL Connection Saturation
Symptom : "Too many connections" error
Quick‑check : show processlist; Remediation : kill 1234; Root‑cause : connection pool not released, traffic surge
Prevention : connection‑pool monitoring, circuit‑breaker
1️⃣1️⃣ Nginx 502/504
Symptom : 502 page
Quick‑check : tail -f /var/log/nginx/error.log Remediation : systemctl reload nginx Root‑cause : backend service down, timeout too low
Prevention : upstream health checks, reasonable timeout settings
1️⃣2️⃣ Redis Memory Full
Symptom : write failures
Quick‑check : redis-cli info memory Remediation : redis-cli flushall # emergency only Root‑cause : missing TTL, hot keys
Prevention : enforce TTL, LRU eviction policy
1️⃣3️⃣ Message‑Queue Backlog
Quick‑check : kafka-consumer-groups.sh --describe, rabbitmqctl list_queues Root‑cause : consumer dead, insufficient consumption capacity
Prevention : consumption rate‑limiting, horizontal scaling of consumers
1️⃣4️⃣ Deployment Failure
Symptom : many 500 errors after release
Quick‑check : kubectl rollout history deployment app Remediation : kubectl rollout undo deployment app Root‑cause : config error, code defect
Prevention : canary rollout, automatic rollback
1️⃣5️⃣ HTTPS Certificate Expiry
Symptom : browser certificate error
Quick‑check :
openssl s_client -connect domain:443 | openssl x509 -noout -datesRemediation : replace certificate immediately
Prevention : automatic renewal, expiry alerts
1️⃣6️⃣ File‑Handle Exhaustion
Symptom : "too many open files"
Quick‑check : lsof | wc -l, ulimit -n Prevention : raise fd limits, manage connection pools
1️⃣7️⃣ Time Skew
Symptom : token invalid, log timestamps out of order
Quick‑check : date, ntpq -p Prevention : NTP synchronization across hosts
1️⃣8️⃣ Mining Malware
Symptom : CPU 100% with xmrig process
Quick‑check : ps aux | grep xmrig Remediation : kill -9 PID Prevention : regular vulnerability scanning, minimise exposed ports
1️⃣9️⃣ Early‑Stage DDoS
Symptom : massive SYN flood
Quick‑check : ss -s Prevention : cloud scrubbing services, high‑availability IPs
2️⃣0️⃣ Automated Inspection Script (On‑Site Rescue)
#!/bin/bash
uptime
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
df -h
ss -lntup | headThe final block summarises the capability chain: "Incident detection → emergency mitigation → post‑mortem governance → architectural safeguards".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
