Backend Development 11 min read

How a Hidden Java Memory Leak Crashed Our Service and the Steps We Took to Fix It

During a weekend on‑call shift a Java detection service repeatedly timed out due to network packet loss, leading to massive CPU usage and full GC cycles caused by a memory leak in a Map that stored request results, which was finally diagnosed and resolved using JVM tools like jstat, jstack, and MAT.

Efficient Ops
Efficient Ops
Efficient Ops
How a Hidden Java Memory Leak Crashed Our Service and the Steps We Took to Fix It

Origin

During a recent on‑call rotation we monitored our detection service, handling alarm emails, bug investigations, and operational issues. Network problems—frequent switch or router failures—caused intermittent timeouts that the detection service captured, prompting us to consider bypassing the service’s keep‑alive mechanism.

Problem

Network Issue?

At around 7 pm we started receiving alarm emails indicating timeout on several interfaces. The stack trace showed the thread stuck in

java.io.BufferedReader.readLine

after an HTTP request was sent and the server responded, but the packet was lost in the network layer.

Our HTTP DNS timeout was set to 1 s, connect timeout to 2 s, and read timeout to 3 s. The logs confirmed the server processed the request correctly, so the issue was network packet loss.

One interface, which uploads a 4 MB file and returns a 2 MB response, timed out more frequently, suggesting larger data transfers increased the chance of loss.

Issue Explosion

At about 8 pm the alarms surged, affecting almost all interfaces, especially the high‑I/O one. Monitoring showed normal metrics, and manual tests succeeded, but stopping the detection task itself hung, indicating a deeper problem.

Solution

Memory Leak

Logging into the detection server revealed an abnormally high CPU usage of 900%.

The Java process should normally stay between 100%–200%; such a spike points to either an infinite loop or excessive garbage collection.

Running

jstat -gc <pid> <interval>

showed a full GC occurring once per second.

We captured a thread dump with

jstack > jstack.log

and a heap dump with

jmap -dump:format=b,file=heap.log <pid>

, then restarted the service, which stopped the alarm emails.

jstat

jstat is a powerful JVM monitoring tool. Common options include:

-class View class loading information

-compile Compilation statistics

-gc Garbage collection information

-gc<xxx> Detailed GC info for specific regions (e.g., -gcold)

It is very helpful for locating JVM memory issues.

Investigation

Analyzing the Stack

We checked the thread count and states:

<code>grep 'java.lang.Thread.State' jstack.log | wc -l
464
</code>

Only about 464 threads were active, which is normal.

<code>grep -A 1 'java.lang.Thread.State' jstack.log | grep -v 'java.lang.Thread.State' | sort | uniq -c | sort -n
     10     at java.lang.Class.forName0(Native Method)
     10     at java.lang.Object.wait(Native Method)
     16     at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
     44     at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    344     at sun.misc.Unsafe.park(Native Method)
</code>

No abnormal thread states were found, so we moved on to the heap analysis.

Downloading the Heap Dump

The heap dump was large (4 GB) and needed compression before transfer. Using

gzip -6

provided a good balance between speed and compression ratio.

Analyzing the Heap with MAT

We opened the

.hprof

file in Eclipse Memory Analyzer (MAT) and selected “Memory Leak Suspect”. The dominant memory consumer was a single object.

The culprit was a Bean containing a

Map

that stored every detection result in an

ArrayList

. Because the Bean was never reclaimed and the Map lacked a cleanup routine, the collection grew over days until it exhausted memory, causing the read‑line blockage.

Code Analysis

Searching the codebase revealed the offending Bean and its Map field. The Map accumulated results without ever being cleared, leading to the memory leak.

We submitted a PR to fix the issue, and the problem was resolved.

Conclusion

Initially, the alarm emails showed stack traces like:

<code>groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:166)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
</code>

These indicate internal errors rather than network issues; recognizing such patterns early can help pinpoint problems before they cascade.

JavaBackend Developmentmemory-leakgcJVM monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.