JVM GC Tuning for Java Service Migration to a Private Cloud: A Multi‑Round Optimization Case Study
During the migration of a billion‑record Java service from physical servers to a private‑cloud Docker environment, a series of JVM GC tuning steps—including adaptive young generation sizing, larger young generation, reduced concurrent GC threads, and phantom‑reference cleanup—significantly reduced stop‑the‑world pauses and restored service performance.
Background : A Java service storing billions of user records in MySQL (sharded across >5 instances) was migrated from a 32‑core/128 GB physical server to an 8‑core/8 GB private‑cloud Docker host. After migration, cache timeouts, increased response latency, and occasional latency spikes were observed, suspected to be caused by JVM GC stop‑the‑world pauses.
Initial Environment : The service ran on JDK 1.6 with the following JVM options (using ParNew + CMS):
-server
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+DisableExplicitGC
-Xms10g
-Xmx10g
-Xmn4g
-Xss1024K
-XX:PermSize=256m
-XX:MaxPermSize=512m
-XX:SurvivorRatio=10
-XX:+ParallelRefProcEnabled
-XX:+CMSParallelRemarkEnabled
-XX:+UseCMSCompactAtFullCollection
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=70
-XX:CMSMaxAbortablePrecleanTime=30000
-XX:SoftRefLRUPolicyMSPerMB=0After migration the JVM memory was reduced to 7 GB.
Problem Identification : GC logs showed frequent young‑generation pauses of 10‑13 ms every ~4 s, and occasional long pauses (>100 ms) in the old generation, correlating with cache timeouts.
Round 1 – Adaptive Young Generation : Enabled GC logging and removed explicit -Xmn . New logs showed a JVM‑chosen young generation of ~156 MB with pauses of 10‑13 ms.
Round 2 – Increase Young Generation Size : Set -Xmn2g to lengthen the GC interval. Pauses grew to 12‑17 ms (young) and 8 ms / 698 ms (old), with limited improvement.
Round 3 – Reduce Concurrent GC Threads : Adjusted -XX:ParallelGCThreads=8 to match the limited CPU cores available in Docker. Young‑gen pauses dropped to 8‑10 ms, but old‑gen remark time increased to ~729 ms.
Round 4 – Full Heap Dump & Analysis : A full heap dump analyzed with Eclipse MAT revealed that the MySQL driver’s com.mysql.jdbc.NonRegisteringDriver retained a large hash map of PhantomReference objects (~812 MB), causing long weak refs processing times during remark.
Round 5 – Clean Connection Reference Map : Implemented a background thread to periodically clear the driver’s connection reference map. Subsequent GC logs showed young‑gen pauses of ~8 ms (interval 8 s) and old‑gen pauses reduced to 11‑131 ms, with noticeable performance gains over days of operation.
Additional Notes : The service uses Tomcat DBCP for connection pooling, so MySQL’s own cleanup is unnecessary. Connection pool configuration is <initialSize>16</initialSize> , <maxActive>16</maxActive> , etc. Upgrading to JDK 1.8 was considered but deemed risky due to compatibility concerns.
Conclusion : By iteratively tuning GC parameters, aligning concurrent thread counts with actual CPU limits, and fixing a memory‑leak‑like issue in the MySQL driver, the service’s GC‑induced pauses were dramatically reduced, cache timeouts vanished, and overall response times returned to or exceeded pre‑migration levels.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.