Backend Development 23 min read

Full GC Diagnosis and Tuning for Xianyu Backend Services

The article details three Xianyu backend incidents where Full GC pauses caused latency spikes or outages, analyzes root causes ranging from survivor space shortage and mis‑tuned CMS to oversized async Log4j events and massive string‑concatenation logs, and presents remediation steps such as switching to G1GC, adjusting Log4j settings, and using parameterized logging to eliminate the pauses.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Full GC Diagnosis and Tuning for Xianyu Backend Services

The Xianyu backend heavily relies on the Java stack and the JVM's managed heap. While automatic memory management simplifies development, garbage collection (GC) can introduce pauses, especially Full GC (FGC), which stops all application threads and may cause service outages.

This article presents three real‑world FGC cases from Xianyu’s front‑end services, describing the symptoms, analysis process, root causes, and remediation steps.

Case 1: Product Domain Core Application

Phenomenon : Occasional FGCs during normal traffic caused latency spikes; during deployments, FGC frequency increased dramatically on machines not in the deployment batch.

Analysis : Monitoring a problematic node revealed the JVM used the ParNew + CMS collector. When Old Gen usage reached ~1.7 GB, CMS Old Gen GC was triggered (CMSInitiatingOccupancyFraction=80%). Survivor space was insufficient, leading to guaranteed allocation (担保分配) and promotion of objects to Old Gen.

GC logs showed a rise of ~28 MB in Old Gen after a Young GC, indicating survivor overflow. Heap dump analysis identified four large objects occupying ~687 MB of Old Gen, leaving little headroom for CMS.

Root Cause : Survivor space shortage caused guaranteed allocation, pushing objects into Old Gen, which quickly reached the CMS threshold, resulting in frequent CMS Old Gen GCs.

Solution : Switch to G1GC with parameters: G1NewSizePercent=30 , InitiatingHeapOccupancyPercent=60 , MaxGCPauseMillis=120 , region size 4 MB. G1’s mixed GC can reclaim promoted objects incrementally, reducing pause times.

Result : Deployment jitter disappeared, FGCs vanished, and total GC time dropped by nearly 50 %.

Case 2: Home Page Application

Phenomenon : During each deployment, a small subset of machines entered FGC and required manual restart. Normal operation showed no FGC.

Analysis : Monitoring showed Old Gen usage climbing after deployment, leading to prolonged FGC pauses. Heap dump revealed massive Log4j RingBufferLogEvent objects (up to 1.4 MB each) filling the async logging queue.

Log4j 2.8.2 uses a synchronous AsyncQueueFullPolicy, causing the logging thread to block when the ring buffer is full. Additionally, each log event copies the message multiple times, inflating memory usage.

Code snippet showing the async queue condition:

!IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty("log4j2.enable.threadlocals", true)

Further code in Log4j’s PatternLayout performs placeholder substitution, adding extra overhead:

public void format(final LogEvent event, final StringBuilder toAppendTo) { ... }

Root Cause : Large asynchronous log events saturated the RingBuffer, causing YGC and eventually STW FGC, which paused the Log4j consumer thread, creating a vicious cycle.

Solution : Limit RingBuffer slots to 2048. Extend cold‑JVM warm‑up period. Upgrade Log4j to 2.14.1, which uses a blocking policy instead of synchronous fallback. Disable placeholder substitution when not needed.

Result : After applying the fixes, no FGC occurred during deployments.

Case 3: Player Business Application

Phenomenon : Suddenly began experiencing frequent FGCs despite stable traffic and normal YGC rates.

Analysis : GC logs showed a “to‑space exhausted” event, indicating Eden shortage and insufficient Survivor space. Subsequent Old Gen usage spikes suggested allocation of large objects (size > half of G1 region, i.e., >16 MB).

Investigation traced the large allocation to a log statement that concatenated a massive MapWrapper object using the ‘+’ operator, which invoked Lombok‑generated toString() and created a huge temporary string. This resulted in several large char[] allocations in Old Gen.

// type definition
@Data
public class MapWrapper {
  public Map
manyEntryMap;
}

MapWrapper instance = xxx;
log.info("print some log with plus concation! obj: " + instance);
// Correct way:
// log.info("print some log with plus concation! obj: {}", instance);

Root Cause : Logging large objects via string concatenation caused massive temporary objects that were promoted to Old Gen, triggering FGC.

Solution : Replace concatenation with parameterized logging (using {} placeholders) and eliminate unnecessary large‑object logs.

Result : After code changes, FGC disappeared.

Summary

The three cases illustrate that FGC can stem from JVM tuning, middleware configuration, or application code. Proper monitoring, heap dump analysis, and understanding of GC algorithms are essential for rapid diagnosis and remediation. Balancing latency and throughput remains the core challenge of GC tuning.

backendJavaJVMGarbage Collectionlog4jFull GCPerformance Tuning
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.