Debugging Rare Core Dumps and Memory Leaks in High‑Concurrency Nginx with OpenSSL
The article describes a real‑world investigation of extremely rare core‑dump bugs and memory‑leak issues in a heavily modified Nginx+OpenSSL stack under high‑concurrency, detailing the debugging workflow, custom stress‑test tools, use of gdb, valgrind, AddressSanitizer, perf, flame graphs and performance‑tuning lessons.
Project Background
We performed deep modifications to the Nginx event framework and OpenSSL stack to improve HTTPS full‑handshake performance, which originally handled only ~400 qps per core for ECDHE_RSA.
Core Dump Debugging
Core dumps occurred with a probability of about one in a hundred million under >10k qps, often clustering at specific times. Traditional gdb and debug logs were ineffective because the asynchronous event model split logical request flows across multiple callbacks.
Defensive NULL‑pointer checks prevented crashes but masked the underlying issue, leading to repeated core dumps in different locations.
Reproducing the Bug
To accelerate debugging, a stable environment that could reliably trigger core dumps was needed. Observations suggested a correlation with weak network conditions during night‑time maintenance.
Constructing Weak Network Conditions
Instead of using tc directly, we decided to generate abnormal requests that simulate network instability, focusing on the TCP and SSL handshake phases.
WRK Stress‑Test Tool
We selected wrk -t500 -c2000 -d30s https://127.0.0.1:8443/index.html for its multi‑threaded, event‑driven architecture capable of generating millions of QPS.
Distributed Automated Test System
A controller machine orchestrates multiple client machines to achieve the required aggregate QPS, supporting configurable protocols, ports, URLs, SSL versions, and cipher suites.
Abnormal Request Construction
Randomly close TCP sockets with a 10% probability.
Randomly abort SSL handshakes at the ClientHello or ClientKeyExchange stages with a 10% probability.
Send HTTPS requests encrypted with an incorrect public key (10% probability) to force decryption failures.
Core Bug Fix Summary
With the reproducible test harness, core dumps were triggered within seconds, allowing rapid iteration of code changes, additional logging, and gdb analysis until the root cause—a misuse of a non‑reusable connection structure under extreme concurrency—was identified and fixed.
Memory Leak
High‑concurrency tests also revealed a memory leak of ~500 MiB per hour.
Valgrind Limitations
Valgrind provides comprehensive memory error detection but reduces performance by 10‑50×, making it unsuitable for reproducing leaks that only appear under heavy load.
AddressSanitizer Advantages
ASan offers fast detection with only ~2× slowdown. By recompiling Nginx with -fsanitize=address using clang, we isolated the leak related to OpenSSL error‑handling logic.
Performance Hotspot Analysis
After fixing crashes and leaks, we focused on profiling to locate remaining bottlenecks using tools such as perf, oprofile, gprof, and systemtap.
Flame Graph
Generating a flame graph with:
perf record -F 99 -p PID -g -- sleep 10
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > out.svgrevealed that rsaz_1024_mul_avx2 and rsaz_1024_sqr_avx2 consumed ~75% of samples, guiding further optimization.
Mindset
The three‑week debugging effort was stressful but valuable; it reinforced the importance of treating hard bugs as learning opportunities, leveraging off‑hours for fresh thinking, and openly discussing problems with teammates.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.