Case Study: Long GC Times Caused by Database Connection Pool Issues and Mitigation Strategies
This article analyzes a production incident where excessive Full GC pauses during a high‑traffic promotion were traced to stale database connections in the DBCP pool, explains the investigation steps, root cause, and presents several JVM and connection‑pool configuration solutions to prevent similar performance degradations.
Introduction: This article presents an online incident case analysis, describing how excessive Full GC pauses during a promotion period were traced to a database connection pool keep‑alive issue.
Problem description: Monitoring showed Full GC times exceeding 500 ms, coinciding with increased interface timeouts.
Application basics: The service runs on a JVM with CMS GC (options: -XX:+UseConcMarkSweepGC -Xms6144m -Xmx6144m -Xmn2048m -XX:ParallelGCThreads=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:+ParallelRefProcEnabled ) and uses MySQL with DBCP connection pool.
Investigation process: Heap dumps before and after Full GC revealed many database‑related objects being reclaimed; OQL analysis showed many connections exceeding maxActive, indicating many stale connections. The eviction task ( org.apache.commons.pool.impl.GenericObjectPool.Evictor ) removed idle connections based on minEvictableIdleTimeMillis and testWhileIdle , causing connections to be reclaimed in the old generation and prolonging GC.
Root cause: During high traffic, idle connections linger longer, move to the old generation, and are evicted, leading to large objects in the old gen and long Full GC pauses, which in turn cause interface timeouts.
Solutions: (1) Switch to G1 GC to reduce pause times and increase MaxTenuringThreshold; (2) Set minEvictableIdleTimeMillis to 0 to keep connections alive; (3) Adjust eviction settings or use a pool with keep‑alive support (e.g., Druid’s KeepAlive option).
Extended knowledge: Discussed Druid pool behavior, validation query execution, FIFO vs FILO usage, impact of phantom references on GC, and differences in MaxTenuringThreshold among CMS, Parallel, and G1 collectors.
Conclusion: Properly configuring the database connection pool to maintain “keep‑alive” and selecting an appropriate GC algorithm are essential to avoid long GC pauses and service degradation during traffic spikes.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.