How a Hidden Java Memory Leak Crashed Our Service and the Steps We Took to Fix It
During a weekend on‑call shift a Java detection service repeatedly timed out due to network packet loss, leading to massive CPU usage and full GC cycles caused by a memory leak in a Map that stored request results, which was finally diagnosed and resolved using JVM tools like jstat, jstack, and MAT.
Origin
During a recent on‑call rotation we monitored our detection service, handling alarm emails, bug investigations, and operational issues. Network problems—frequent switch or router failures—caused intermittent timeouts that the detection service captured, prompting us to consider bypassing the service’s keep‑alive mechanism.
Problem
Network Issue?
At around 7 pm we started receiving alarm emails indicating timeout on several interfaces. The stack trace showed the thread stuck in
java.io.BufferedReader.readLineafter an HTTP request was sent and the server responded, but the packet was lost in the network layer.
Our HTTP DNS timeout was set to 1 s, connect timeout to 2 s, and read timeout to 3 s. The logs confirmed the server processed the request correctly, so the issue was network packet loss.
One interface, which uploads a 4 MB file and returns a 2 MB response, timed out more frequently, suggesting larger data transfers increased the chance of loss.
Issue Explosion
At about 8 pm the alarms surged, affecting almost all interfaces, especially the high‑I/O one. Monitoring showed normal metrics, and manual tests succeeded, but stopping the detection task itself hung, indicating a deeper problem.
Solution
Memory Leak
Logging into the detection server revealed an abnormally high CPU usage of 900%.
The Java process should normally stay between 100%–200%; such a spike points to either an infinite loop or excessive garbage collection.
Running
jstat -gc <pid> <interval>showed a full GC occurring once per second.
We captured a thread dump with
jstack > jstack.logand a heap dump with
jmap -dump:format=b,file=heap.log <pid>, then restarted the service, which stopped the alarm emails.
jstat
jstat is a powerful JVM monitoring tool. Common options include:
-class View class loading information
-compile Compilation statistics
-gc Garbage collection information
-gc<xxx> Detailed GC info for specific regions (e.g., -gcold)
It is very helpful for locating JVM memory issues.
Investigation
Analyzing the Stack
We checked the thread count and states:
<code>grep 'java.lang.Thread.State' jstack.log | wc -l
464
</code>Only about 464 threads were active, which is normal.
<code>grep -A 1 'java.lang.Thread.State' jstack.log | grep -v 'java.lang.Thread.State' | sort | uniq -c | sort -n
10 at java.lang.Class.forName0(Native Method)
10 at java.lang.Object.wait(Native Method)
16 at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
44 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
344 at sun.misc.Unsafe.park(Native Method)
</code>No abnormal thread states were found, so we moved on to the heap analysis.
Downloading the Heap Dump
The heap dump was large (4 GB) and needed compression before transfer. Using
gzip -6provided a good balance between speed and compression ratio.
Analyzing the Heap with MAT
We opened the
.hproffile in Eclipse Memory Analyzer (MAT) and selected “Memory Leak Suspect”. The dominant memory consumer was a single object.
The culprit was a Bean containing a
Mapthat stored every detection result in an
ArrayList. Because the Bean was never reclaimed and the Map lacked a cleanup routine, the collection grew over days until it exhausted memory, causing the read‑line blockage.
Code Analysis
Searching the codebase revealed the offending Bean and its Map field. The Map accumulated results without ever being cleared, leading to the memory leak.
We submitted a PR to fix the issue, and the problem was resolved.
Conclusion
Initially, the alarm emails showed stack traces like:
<code>groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:166)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
</code>These indicate internal errors rather than network issues; recognizing such patterns early can help pinpoint problems before they cascade.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.