Encountering Nginx 502 Errors? A Step‑by‑Step Guide to Fast Troubleshooting
Nginx 502 Bad Gateway is one of the most frequent operational issues; this article outlines a systematic, layered approach—from checking Nginx error logs and backend service status to network connectivity, resource limits, timeout settings, and permission problems—providing concrete commands, example scenarios, and preventive measures to quickly identify and resolve the root cause.
Problem background
Nginx returns 502 Bad Gateway when it successfully receives a client request but fails to obtain a valid response from the upstream service. The error indicates a communication problem between Nginx and the backend, not an internal server error.
Common root‑cause categories
Backend service unavailable : crashed processes, OOM kills, ports not listening, firewall blocks, or services that start but do not respond.
Resource exhaustion : insufficient memory, CPU saturation, full disks, exhausted file descriptors, or system limits.
Misconfiguration : overly short proxy_connect_timeout, proxy_send_timeout, proxy_read_timeout, incorrect health‑check intervals, or wrong load‑balancing settings.
Permission / socket issues : mismatched run‑user between Nginx and the backend, wrong Unix socket permissions, or SELinux/AppArmor blocks.
Network problems : different network zones, VPC security‑group changes, Docker network mis‑configuration, or Kubernetes service discovery failures.
Troubleshooting path and core commands
Step 1 – Inspect Nginx error log
Default location:
/var/log/nginx/error.log tail -n 100 /var/log/nginx/error.log tail -f /var/log/nginx/error.logTypical entries: connect() failed (111: Connection refused) or upstream prematurely closed connection.
Step 2 – Analyse access‑log status distribution
# Count 502 occurrences
grep " 502 " /var/log/nginx/access.log | wc -l
# Show recent 502 sources (IP, path, host)
grep " 502 " /var/log/nginx/access.log | tail -n 50 | awk '{print $1, $7, $12}'
# Top 20 URLs causing 502
grep " 502 " /var/log/nginx/access.log | awk -F'"' '{print $2}' | sort | uniq -c | sort -rn | head -20Step 3 – Verify network connectivity
# Test backend port
curl -v http://127.0.0.1:8080/api 2>&1
# Measure response time
curl -w "
Time: %{time_total}s
" http://127.0.0.1:8080/api
# Check listening sockets
ss -tlnp | grep :8080
netstat -tlnp | grep :8080Step 4 – Check backend service status
PHP‑FPM
systemctl status php-fpm
ps aux | grep php-fpm
curl http://127.0.0.1:9000/status
cat /var/log/php-fpm/www-error.log
grep -E "^(pm|pm\.|pm\.max)" /etc/php-fpm.d/www.confJava (Tomcat/Jetty)
ps aux | grep java
tail -n 100 /var/log/tomcat/catalina.out
curl -s http://127.0.0.1:8080/manager/statusNode.js / Python
ps aux | grep -E 'node|python'
ps -p $(pgrep -f "node server.js") -o pid,etime,cmd
tail -f /var/log/myapp/application.logStep 5 – Examine server resource usage
# CPU
top -b -n 1 | head -20
# Memory
free -h
# Disk
df -h
# Process count
ps aux | wc -l
# Open file descriptors
lsof 2>/dev/null | wc -l
cat /proc/sys/fs/file-max
ulimit -nStep 6 – Review Nginx timeout configuration
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;Increase these values if backend processing exceeds the defaults.
Typical scenarios and fixes
Scenario 1 – All PHP‑FPM workers died
Symptoms: every PHP page returns 502; error log shows connect() failed or no live workers.
# Verify PHP‑FPM is running
systemctl status php-fpm
ps aux | grep php-fpm
# Inspect PHP‑FPM error logs
cat /var/log/php-fpm/error.log
cat /var/log/php-fpm/www-error.log
# Restart if missing
systemctl start php-fpm
systemctl status php-fpm -l
journalctl -u php-fpm -n 50Common causes:
Memory leak – set pm.max_requests = 500 to recycle workers.
Wrong socket/port – ensure listen = /var/run/php-fpm/php-fpm.sock matches Nginx fastcgi_pass.
Socket permission – set listen.owner = nginx, listen.group = nginx, listen.mode = 0660 and restart.
Scenario 2 – Backend Java / Node.js / Python crash
# OOM detection
dmesg -T | grep -i "out of memory"
dmesg -T | grep -i "killed process"
# Service status
systemctl status myapp
journalctl -u myapp -n 100 | grep -E "Exit|error|fail"If OOM is the cause, increase JVM heap or container limits:
export JAVA_OPTS="-Xms512m -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError"Database connection timeout can also trigger 502; test with:
mysql -h db-server -u appuser -p -e "SELECT 1"
grep -E "spring.datasource|spring.jpa|connection-pool" /app/config/application.ymlScenario 3 – Memory exhaustion and OOM Killer
# Memory check
free -m
ps aux --sort=-%mem | head -15
dmesg -T | grep -i "out of memory"
dmesg -T | grep -i "killed process"
# Release pagecache (temporary)
sync && echo 3 > /proc/sys/vm/drop_caches
# Restart heavy services
systemctl restart myappLong‑term: add swap, tune oom_score_adj for critical processes.
Scenario 4 – Backend response timeout
# Measure latency
time curl -v http://backend:8080/api/heavy
# Inspect Nginx timeout settings
grep -E "timeout" /etc/nginx/conf.d/*.conf
# Check Java slow‑query logs
grep -E "slowquery|slow query" /var/log/myapp/application.log
# Database performance
mysql -h db -u app -p -e "SHOW FULL PROCESSLIST;"Increase Nginx timeouts (example 120 s) and optimise backend processing.
Scenario 5 – Unix socket permission problem
# Socket permissions
ls -la /var/run/php-fpm/
# Verify run users
ps aux | grep "php-fpm: master" | awk '{print $1}'
ps aux | grep "nginx: master" | awk '{print $1}'
# Nginx error log for permission
tail -n 20 /var/log/nginx/error.log | grep -i permission
# Fix socket config
listen = /var/run/php-fpm/php-fpm.sock
listen.owner = nginx
listen.group = nginx
listen.mode = 0660
systemctl restart php-fpmScenario 6 – Docker / Kubernetes network issue
# Docker Compose status
docker-compose ps
docker-compose logs -f backend
docker-compose exec nginx sh -c "curl -v http://backend:8080/health"
docker network ls
docker network inspect <network_name>
# Kubernetes checks
kubectl get pods -n <namespace>
kubectl logs -n <namespace> <pod-name>
kubectl get endpoints -n <namespace>
kubectl exec -it -n <namespace> nginx-pod -- sh -c "curl -v http://backend-service:8080/health"
kubectl describe pod -n <namespace> <pod-name> | grep -E "Events|Conditions"Fix container start‑order with health checks or adjust network drivers.
Scenario 7 – Connection‑limit exhaustion under high load
# Nginx connections
ss -s
# Backend connection count
ss -ant | grep :8080 | wc -l
# System file‑descriptor limit
ulimit -n
# Increase limits temporarily
ulimit -n 65535
# Persist limits ( /etc/security/limits.conf )
cat >> /etc/security/limits.conf <<'EOF'
* soft nofile 65535
* hard nofile 65535
nginx soft nofile 65535
nginx hard nofile 65535
EOF
# Restart Nginx
systemctl restart nginx
# Nginx worker settings
worker_processes auto;
worker_rlimit_nofile 65535;
events {
worker_connections 65535;
use epoll;
multi_accept on;
}
# Backend thread pool (Spring Boot example)
server:
tomcat:
max-threads: 800
max-connections: 16384
accept-count: 200Prevention measures
Health‑check configuration
upstream backend {
server backend1:8080 max_fails=3 fail_timeout=30s;
server backend2:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}Monitoring & alerting (Prometheus + Grafana)
# Nginx stub_status endpoint
server {
listen 8888;
server_name localhost;
stub_status on;
allow 127.0.0.1;
deny all;
}
# Prometheus scrape config
scrape_configs:
- job_name: 'nginx'
static_configs:
- targets: ['localhost:8888']
# Alert rule for high 502 rate
groups:
- name: nginx_502_alert
rules:
- alert: High502Rate
expr: rate(nginx_http_requests_total{status="502"}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Nginx 502 error rate too high"
description: "502 error rate exceeds 10%, current value: {{ $value }}"Enhanced logging
log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" upstream_addr: $upstream_addr upstream_status: $upstream_status request_time: $request_time upstream_response_time: $upstream_response_time';
access_log /var/log/nginx/access.log main;Verification after fix
# Functional curl checks
for path in "/" "/api/users" "/api/products" "/health"; do
status=$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1$path)
if [[ "$status" == "200" || "$status" == "301" || "$status" == "302" ]]; then
echo "[PASS] $path -> $status"
else
echo "[FAIL] $path -> $status"
fi
done
# Simulate concurrency
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code}
" http://127.0.0.1/api/users & done
wait
# Ensure no new 502 in access log
grep " 502 " /var/log/nginx/access.log | wc -lSummary
The workflow starts with Nginx error logs, then verifies backend liveness and resources, followed by network checks and timeout configuration review. Root causes fall into five categories, each with specific commands and remediation steps. Implementing proactive health checks, comprehensive monitoring, detailed logging, and regular disaster‑recovery drills prevents recurrence and shortens MTTR for future 502 incidents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
