Root Cause Analysis and Resolution of OutOfMemoryError in a Java Backend Service
This article details a comprehensive investigation of a Java backend service suffering from severe OutOfMemoryError due to an unbounded userId list in a count query, describing monitoring findings, heap dump analysis, and practical mitigation steps including request limiting and JVM tuning.
Phenomenon
A certain online service exhibited extremely slow response times; monitoring revealed a large GAP time despite the actual request processing time being short, and many such requests occurred during the period.
Root Cause Analysis
Trace through the monitoring chain showed that requests reached the service but waited about 3 seconds before processing. CPU spiked and frequent long GC cycles occurred, eventually filling the heap and causing the pod to be terminated. Logs contained an OutOfMemoryError:
system error: org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
...A large batch job running at the same time was suspected, but its code showed no obvious issue. JVM parameters such as -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=/logs/oom_dump/xxx.log -XX:HeapDumpPath=/logs/oom_dump/xxx.hprof were set, yet the container killed the pod before the dump could be retained.
After the pod restart, a 4.8 GB heap dump was generated; jvisualvm failed to load it but identified the OOM‑triggering thread. The dump revealed a massive count SQL statement that allocated a 1.07 GB byte array and a 1.03 GB char array.
The offending userId list originated from an external system and was 64 MB in size. Investigation uncovered a bug in the upstream system that sent all user IDs for the query.
Solution
The receiving system now enforces a limit on the number of userId values, and the same restriction is applied on our side, turning a complex troubleshooting effort into a simple configuration change.
Wait, There’s Another One
A similar incident occurred later: multiple alerts and machine crashes due to memory exhaustion. Heap dumps (up to 12 GB) were analyzed with MAT, revealing huge String objects. The root cause was a full‑table query without a WHERE clause on TiDB, which loaded the entire user table into memory.
Conclusion
When facing OOM issues without obvious code defects, the following JVM options are valuable, especially in containerized environments:
-XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=/logs/oom_dump/xxx.log -XX:HeapDumpPath=/logs/oom_dump/xxx.hprofAdditionally, -XX:+ExitOnOutOfMemoryError forces the JVM to terminate promptly, allowing Kubernetes to restart a fresh instance.
For SQL statements lacking a WHERE clause, enforce a sensible LIMIT to prevent full‑table scans from overwhelming the system.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.