Operations 12 min read

One Nginx Config Change Triggered a P0 Outage on Promotion Day – 5 Hard‑Earned Lessons

A single missing keepalive setting in Nginx caused a massive P0 outage during a sales promotion, and the article walks through five real incidents—covering logging, WebSocket timeouts, Docker worker counts, reload pitfalls, and SSL expiry—offering concrete configuration fixes and preventive best practices.

dbaplus Community
dbaplus Community
dbaplus Community
One Nginx Config Change Triggered a P0 Outage on Promotion Day – 5 Hard‑Earned Lessons

1. Missing response time in logs

Nginx's default log format lacks response‑time fields, making performance troubleshooting impossible. The correct approach is to define a custom log format that includes rt (request_time), uct (upstream_connect_time), uht (upstream_header_time) and urt (upstream_response_time). Example:

log_format detail '$remote_addr [$time_local] "$request" $status '
               'rt=$request_time '
               'uct=$upstream_connect_time '
               'uht=$upstream_header_time '
               'urt=$upstream_response_time "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log detail;

Understanding the relationship between these fields lets you pinpoint whether latency originates in the network, the upstream service, or the connection queue. Additionally, enable daily log rotation with logrotate and send a USR1 signal (not reload) to reopen log files without disrupting keepalive connections.

2. WebSocket connections drop after minutes

The most common cause is the default proxy_read_timeout of 60 seconds, which closes idle long‑lived connections. Another frequent mistake is omitting the required Upgrade and Connection headers during protocol upgrade.

# map to set Connection header based on Upgrade presence
map $http_upgrade $connection_upgrade {
    default upgrade;
    ''      close;  # ordinary HTTP requests
}
location /ws/ {
    proxy_pass http://websocket_backend;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
}

When WebSocket traffic is load‑balanced, use ip_hash only as a temporary fix; the proper solution is to make the backend stateless and store connection context in Redis, allowing true horizontal scaling.

3. Docker Nginx memory bloat

Inside a container, worker_processes auto reads the host's CPU count, potentially spawning dozens of workers and exhausting memory. The fix is to set the worker count based on the container's allocated CPUs:

#!/bin/sh
# entrypoint.sh
CORES=$(nproc)  # respects cgroup limits
sed -i "s/worker_processes auto/worker_processes $CORES/" /etc/nginx/nginx.conf
exec nginx -g "daemon off;"

Starting Nginx with exec makes it PID 1 so that SIGTERM is correctly forwarded, preventing abrupt request termination.

4. Reload is not always lossless

For short‑lived HTTP requests, nginx -s reload appears seamless, but with long‑lived connections (WebSocket, large downloads) the old workers wait for existing connections up to worker_shutdown_timeout. After that timeout they are killed, cutting active streams.

# Set a generous shutdown timeout for services with long connections
worker_shutdown_timeout 3600s;

Hot upgrade steps must follow a precise signal sequence:

# 1. Start new master (old PID kept)
kill -USR2 $OLD_PID
# 2. Gracefully stop old workers from accepting new connections
kill -WINCH $OLD_PID
# 3. After confirming the new master works, quit the old master
kill -QUIT $(cat /var/run/nginx.pid.oldbin)
# Rollback if needed
kill -HUP $(cat /var/run/nginx.pid.oldbin)   # restart old workers
kill -QUIT $(cat /var/run/nginx.pid)        # stop new master

Never use kill -9 on old processes during a hot upgrade, as it aborts all existing connections.

5. Five real incidents and lessons

Incident 1: An upstream entry pointed to a decommissioned IP, causing a site‑wide 502 for five minutes. Best practice: Store upstreams as DNS names, manage config changes via Git, and tighten health‑check parameters (e.g., max_fails=2 fail_timeout=10s).

Incident 2: An expired SSL certificate led to a 40‑minute outage. Best practice: Automate renewal with Let’s Encrypt or similar, schedule daily certbot renew, and monitor expiry with a 30‑day pre‑alert.

Incident 3: A misplaced if in a location block routed all POST requests to a backend lacking the endpoint, resulting in 404s. Best practice: Avoid if inside location; use map to select backends based on request method.

Incident 4: Enabling gzip for application/json without respecting the client’s Accept‑Encoding produced garbled JSON in the frontend. Best practice: Set gzip_proxied to compress only when the client advertises support (e.g.,

gzip_proxied expired no-cache no-store private no_last_modified no_etag auth

).

Incident 5: The opening promotion‑day outage was caused by missing keepalive settings, overwhelming the backend with short TCP connections. Best practice: Enable keepalive (e.g., keepalive 64;, keepalive_requests 1000;, keepalive_timeout 60s;) and validate connection reuse in load‑testing scripts.

All these pitfalls appear trivial in hindsight but cause massive disruption when they occur. Proactively review configurations, validate changes with realistic load tests, and include these checks in pre‑promotion checklists to avoid costly emergencies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Dockerkuberneteswebsocketnginxincidentkeepalive
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.