How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices
This article walks through a real‑world production case, detailing how to systematically detect, analyze, and resolve severe JVM garbage‑collection pauses in a high‑concurrency Spring Boot microservice, covering resource analysis, JVM flag tuning, G1GC migration, JMX listeners, and GC‑log investigation.
Introduction
This article walks through a real‑world production case to systematically diagnose and resolve JVM garbage‑collection (GC) performance problems in a high‑concurrency microservice built with Spring Boot.
System Background
The service runs as a microservice with the following stack:
Application framework: Spring Boot
Metrics collection: Micrometer
Monitoring system: Datadog
Micrometer supports many back‑ends such as AppOptics, Atlas, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, New Relic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront, etc.
Problem Symptoms
Problem Description
Monitoring revealed severe GC pauses on one node:
Maximum GC pause time frequently > 400 ms
Peak pause reached 546 ms on 2020‑02‑04 09:20:00
Business Impact
Service timeout: 1 s timeout, long GC pauses cause timeout risk
Performance requirement: max pause < 200 ms, average pause < 100 ms
Business impact: severe effect on customer trading strategies
Investigation Process
Step 1 – System Resource Analysis
CPU Load
CPU usage was examined; the monitoring chart shows:
Observed values: system load 4.92, CPU utilization ~7 %.
GC Memory Usage
Memory usage around 09:25 shows a sharp drop in
old_gen, indicating a Full GC, but the period around 09:20 shows a gradual increase without a Full GC, meaning the long pause was not caused by a Full GC.
Step 2 – JVM Configuration Analysis
Startup Parameters
<code>-Xmx4g -Xms4g</code>JDK version: 8
GC: default ParallelGC
Heap size: 4 GB (initial and max)
Initial Hypothesis
ParallelGC may be the root cause because it optimizes throughput at the expense of pause time.
First Optimization Attempt – Switch to G1GC
Why G1GC
Stability in JDK 8
Good latency control
Suitable for low‑latency workloads
Configuration
Initial (failed) config
<code># Parameter typo caused startup failure
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50ms</code>Errors:
Typo:
MaxGCPauseMills→
MaxGCPauseMillisValue format:
50ms→
50Corrected config
<code>-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50</code>After redeployment the service started successfully and monitoring showed GC pauses mostly under 50 ms.
Unexpected “Easter Egg”
Later a pause of 1300 ms appeared, and subsequent analysis showed the same pattern of long pauses.
Register GC Event Listener via JMX
Code to register a listener for each
GarbageCollectorMXBean:
<code>// Register listener for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
if (!(mbean instanceof NotificationEmitter)) {
continue; // not support listening
}
NotificationEmitter emitter = (NotificationEmitter) mbean;
NotificationListener listener = getNewListener(mbean);
emitter.addNotificationListener(listener, null, null);
}
</code>The listener prints detailed GC event JSON, revealing a young‑generation pause of 1.869 s with 48 GC worker threads.
<code>{
"duration":1869,
"maxPauseMillis":1869,
"promotedBytes":"139MB",
"gcCause":"G1 Evacuation Pause",
"collectionTime":27281,
"gcAction":"end of minor GC",
"afterUsage":{
"G1 Old Gen":"1745MB",
"Code Cache":"53MB",
"G1 Survivor Space":"254MB",
"Compressed Class Space":"9MB",
"Metaspace":"81MB",
"G1 Eden Space":"0"
},
"gcId":326,
"collectionCount":326,
"gcName":"G1 Young Generation",
"type":"jvm.gc.pause"
}
</code>GC Log Analysis
Enabling
-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStampsproduced logs showing a 1.87 s pause with 48 parallel GC threads, while the container was limited to 4 CPU cores.
The mismatch between JVM‑detected CPU count (≈72) and the pod limit (4 cores) caused massive thread contention.
Final Solution – Limit GC Parallel Threads
Adding
-XX:ParallelGCThreads=4aligns GC workers with the pod’s CPU quota:
<code>-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4 -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps</code>After restart, GC pauses stayed within the 50 ms target.
Case Summary and Takeaways
Quantitative monitoring is essential for JVM performance tuning.
In containerized environments, JVM‑visible CPU cores must be reconciled with Kubernetes limits.
Adjusting
ParallelGCThreads(or using G1GC) can dramatically reduce pause times.
Combining metric monitoring, JVM flag tuning, GC‑log analysis, and JMX listeners provides a systematic troubleshooting workflow.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.