Backend Development 16 min read

How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices

This article walks through a real‑world production case, detailing how to systematically detect, analyze, and resolve severe JVM garbage‑collection pauses in a high‑concurrency Spring Boot microservice, covering resource analysis, JVM flag tuning, G1GC migration, JMX listeners, and GC‑log investigation.

IT Services Circle
IT Services Circle
IT Services Circle
How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices

Introduction

This article walks through a real‑world production case to systematically diagnose and resolve JVM garbage‑collection (GC) performance problems in a high‑concurrency microservice built with Spring Boot.

System Background

The service runs as a microservice with the following stack:

Application framework: Spring Boot

Metrics collection: Micrometer

Monitoring system: Datadog

Micrometer supports many back‑ends such as AppOptics, Atlas, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, New Relic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront, etc.

Problem Symptoms

Problem Description

Monitoring revealed severe GC pauses on one node:

Maximum GC pause time frequently > 400 ms

Peak pause reached 546 ms on 2020‑02‑04 09:20:00

GC pause time chart
GC pause time chart

Business Impact

Service timeout: 1 s timeout, long GC pauses cause timeout risk

Performance requirement: max pause < 200 ms, average pause < 100 ms

Business impact: severe effect on customer trading strategies

Investigation Process

Step 1 – System Resource Analysis

CPU Load

CPU usage was examined; the monitoring chart shows:

CPU load chart
CPU load chart

Observed values: system load 4.92, CPU utilization ~7 %.

GC Memory Usage

Memory usage around 09:25 shows a sharp drop in

old_gen

, indicating a Full GC, but the period around 09:20 shows a gradual increase without a Full GC, meaning the long pause was not caused by a Full GC.

Old generation memory chart
Old generation memory chart

Step 2 – JVM Configuration Analysis

Startup Parameters

<code>-Xmx4g -Xms4g</code>

JDK version: 8

GC: default ParallelGC

Heap size: 4 GB (initial and max)

Initial Hypothesis

ParallelGC may be the root cause because it optimizes throughput at the expense of pause time.

First Optimization Attempt – Switch to G1GC

Why G1GC

Stability in JDK 8

Good latency control

Suitable for low‑latency workloads

Configuration

Initial (failed) config

<code># Parameter typo caused startup failure
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50ms</code>

Errors:

Typo:

MaxGCPauseMills

MaxGCPauseMillis

Value format:

50ms

50

Corrected config

<code>-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50</code>

After redeployment the service started successfully and monitoring showed GC pauses mostly under 50 ms.

G1GC early effect chart
G1GC early effect chart

Unexpected “Easter Egg”

Later a pause of 1300 ms appeared, and subsequent analysis showed the same pattern of long pauses.

Long pause chart
Long pause chart

Register GC Event Listener via JMX

Code to register a listener for each

GarbageCollectorMXBean

:

<code>// Register listener for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
    if (!(mbean instanceof NotificationEmitter)) {
        continue; // not support listening
    }
    NotificationEmitter emitter = (NotificationEmitter) mbean;
    NotificationListener listener = getNewListener(mbean);
    emitter.addNotificationListener(listener, null, null);
}
</code>

The listener prints detailed GC event JSON, revealing a young‑generation pause of 1.869 s with 48 GC worker threads.

<code>{
  "duration":1869,
  "maxPauseMillis":1869,
  "promotedBytes":"139MB",
  "gcCause":"G1 Evacuation Pause",
  "collectionTime":27281,
  "gcAction":"end of minor GC",
  "afterUsage":{
    "G1 Old Gen":"1745MB",
    "Code Cache":"53MB",
    "G1 Survivor Space":"254MB",
    "Compressed Class Space":"9MB",
    "Metaspace":"81MB",
    "G1 Eden Space":"0"
  },
  "gcId":326,
  "collectionCount":326,
  "gcName":"G1 Young Generation",
  "type":"jvm.gc.pause"
}
</code>

GC Log Analysis

Enabling

-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

produced logs showing a 1.87 s pause with 48 parallel GC threads, while the container was limited to 4 CPU cores.

GC log excerpt
GC log excerpt

The mismatch between JVM‑detected CPU count (≈72) and the pod limit (4 cores) caused massive thread contention.

CPU load chart with pod limit
CPU load chart with pod limit

Final Solution – Limit GC Parallel Threads

Adding

-XX:ParallelGCThreads=4

aligns GC workers with the pod’s CPU quota:

<code>-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4 -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps</code>

After restart, GC pauses stayed within the 50 ms target.

Post‑tuning GC pause chart
Post‑tuning GC pause chart

Case Summary and Takeaways

Quantitative monitoring is essential for JVM performance tuning.

In containerized environments, JVM‑visible CPU cores must be reconciled with Kubernetes limits.

Adjusting

ParallelGCThreads

(or using G1GC) can dramatically reduce pause times.

Combining metric monitoring, JVM flag tuning, GC‑log analysis, and JMX listeners provides a systematic troubleshooting workflow.

JVMMicroservicesKubernetesGarbage Collectionperformance tuningG1GC
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.