How to Eliminate GC Pauses in High‑QPS Java Services: A Step‑by‑Step JVM Tuning Guide
This article investigates a high‑concurrency Java service that suffers from long GC pauses during large index swaps, identifies YGC Object Copy as the root cause, and presents a series of JVM tuning techniques—including MaxTenuringThreshold, InitialTenuringThreshold, AlwaysTenure, G1HeapRegionSize, ZGC, and an Eden‑preheat strategy—to achieve near‑zero service disruption and 99.995% success rate.
Problem Background
A high‑traffic (up to 100k QPS) low‑latency Java service experiences instability when swapping a ~0.5 GB in‑memory index; upstream calls time out, and success rate drops from 95% to 99.5%.
Root Cause Analysis
GC logs show that during index swaps the Young Generation GC (YGC) spends excessive time in the Object Copy phase, copying the large index multiple times and causing long STW pauses that block all request threads.
Optimization Attempts
Let index promote early to Old Generation – Adjust
MaxTenuringThreshold(set to 1) and
InitialTenuringThreshold(set to 1). G1GC already performs direct tenuring, so this had limited effect.
Force always‑tenure – Enable
AlwaysTenureto skip Survivor space; results similar to the above.
Direct allocation to Old – Tried
PretenureSizeThresholdand
G1HeapRegionSize, but G1GC ignores them for the index’s many small objects.
Accelerate copy – Tweaked
MaxGCPauseMillis,
ParallelGCThreads,
ConcGCThreads; no noticeable gain.
Upgrade to ZGC (JDK 11) – Reduced pause times dramatically, raising success rate to 99.5%, but occasional Allocation Stalls remained.
Final Solution: Eden‑Preheat + Batch Gray Release
During a controlled “gray‑release” (service is temporarily stopped), allocate temporary objects to exhaust the Eden space, forcing a YGC that moves the new index to Old before traffic resumes. This guarantees that subsequent YGCs are fast (milliseconds) and eliminates timeout spikes.
<code>public boolean switchIndex(String indexPath) {<br/> try {<br/> // 1. Load new index (service is offline)<br/> MyIndex newIndex = loadIndex(indexPath);<br/> // 2. Switch index<br/> this.index = newIndex;<br/> // 3. Eden pre‑heat – force a GC<br/> for (int i = 0; i < 10000; i++) {<br/> char[] tempArr = new char[524288];<br/> }<br/> // 4. Notify upstream that switch is complete<br/> return true;<br/> } catch (Exception e) {<br/> return false;<br/> }<br/>}</code>Result
After applying the pre‑heat strategy together with the batch gray release, the service achieves a stable success rate of 99.995% with virtually no observable impact during index swaps.
Key Takeaways
Identify GC phases that cause long pauses (Object Copy).
Adjust tenuring thresholds, but recognize G1GC’s built‑in optimizations.
Consider newer collectors (ZGC) for low‑pause requirements.
When GC pauses cannot be eliminated, use operational tricks (pre‑heat, gray release) to move heavy objects to Old before traffic resumes.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.