Backend Development 16 min read

How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices

This article walks through a real‑world production case, detailing how to systematically detect, analyze, and resolve severe JVM garbage‑collection pauses in a high‑concurrency Spring Boot microservice, covering resource analysis, JVM flag tuning, G1GC migration, JMX listeners, and GC‑log investigation.

IT Services Circle

Jun 15, 2025

How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices

Introduction

This article walks through a real‑world production case to systematically diagnose and resolve JVM garbage‑collection (GC) performance problems in a high‑concurrency microservice built with Spring Boot.

System Background

The service runs as a microservice with the following stack:

Application framework: Spring Boot

Metrics collection: Micrometer

Monitoring system: Datadog

Micrometer supports many back‑ends such as AppOptics, Atlas, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, New Relic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront, etc.

Problem Symptoms

Problem Description

Monitoring revealed severe GC pauses on one node:

Maximum GC pause time frequently > 400 ms

Peak pause reached 546 ms on 2020‑02‑04 09:20:00

Business Impact

Service timeout: 1 s timeout, long GC pauses cause timeout risk

Performance requirement: max pause < 200 ms, average pause < 100 ms

Business impact: severe effect on customer trading strategies

Investigation Process

Step 1 – System Resource Analysis

CPU Load

CPU usage was examined; the monitoring chart shows:

Observed values: system load 4.92, CPU utilization ~7 %.

GC Memory Usage

Memory usage around 09:25 shows a sharp drop in old_gen, indicating a Full GC, but the period around 09:20 shows a gradual increase without a Full GC, meaning the long pause was not caused by a Full GC.

Step 2 – JVM Configuration Analysis

Startup Parameters

-Xmx4g -Xms4g

JDK version: 8

GC: default ParallelGC

Heap size: 4 GB (initial and max)

Initial Hypothesis

ParallelGC may be the root cause because it optimizes throughput at the expense of pause time.

First Optimization Attempt – Switch to G1GC

Why G1GC

Stability in JDK 8

Good latency control

Suitable for low‑latency workloads

Configuration

Initial (failed) config

# Parameter typo caused startup failure
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50ms

Errors:

Typo: MaxGCPauseMills → MaxGCPauseMillis Value format: 50ms →

Corrected config

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50

After redeployment the service started successfully and monitoring showed GC pauses mostly under 50 ms.

Unexpected “Easter Egg”

Later a pause of 1300 ms appeared, and subsequent analysis showed the same pattern of long pauses.

Register GC Event Listener via JMX

Code to register a listener for each GarbageCollectorMXBean:

// Register listener for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
    if (!(mbean instanceof NotificationEmitter)) {
        continue; // not support listening
    }
    NotificationEmitter emitter = (NotificationEmitter) mbean;
    NotificationListener listener = getNewListener(mbean);
    emitter.addNotificationListener(listener, null, null);
}

The listener prints detailed GC event JSON, revealing a young‑generation pause of 1.869 s with 48 GC worker threads.

{
  "duration":1869,
  "maxPauseMillis":1869,
  "promotedBytes":"139MB",
  "gcCause":"G1 Evacuation Pause",
  "collectionTime":27281,
  "gcAction":"end of minor GC",
  "afterUsage":{
    "G1 Old Gen":"1745MB",
    "Code Cache":"53MB",
    "G1 Survivor Space":"254MB",
    "Compressed Class Space":"9MB",
    "Metaspace":"81MB",
    "G1 Eden Space":"0"
  },
  "gcId":326,
  "collectionCount":326,
  "gcName":"G1 Young Generation",
  "type":"jvm.gc.pause"
}

GC Log Analysis

Enabling -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps produced logs showing a 1.87 s pause with 48 parallel GC threads, while the container was limited to 4 CPU cores.

The mismatch between JVM‑detected CPU count (≈72) and the pod limit (4 cores) caused massive thread contention.

Final Solution – Limit GC Parallel Threads

Adding -XX:ParallelGCThreads=4 aligns GC workers with the pod’s CPU quota:

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4 -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

After restart, GC pauses stayed within the 50 ms target.

Case Summary and Takeaways

Quantitative monitoring is essential for JVM performance tuning.

In containerized environments, JVM‑visible CPU cores must be reconciled with Kubernetes limits.

Adjusting ParallelGCThreads (or using G1GC) can dramatically reduce pause times.

Combining metric monitoring, JVM flag tuning, GC‑log analysis, and JMX listeners provides a systematic troubleshooting workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM Kubernetes Garbage Collection Performance tuning g1gc

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.