Operations 15 min read

Root Cause Analysis of Nighttime Service Latency Caused by Major Page Faults in Category Client

The nightly service latency spikes were traced to major page faults triggered by the category client’s lazy mmap of a 1.7 GB off‑heap store during package switches, not to GC or CPU throttling, and were resolved by upgrading to a version that pre‑loads the store in 64 MB chunks with brief pauses.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Root Cause Analysis of Nighttime Service Latency Caused by Major Page Faults in Category Client

Background: A critical application in Xianyu depends on a rich‑client category system to provide CPV data. Every night the service experiences severe latency spikes (100 ms → 3‑5 s), RPC success rate drops (100% → ~92%), and the RPC thread pool thread count rises (50 → ~100) during the data‑package switch.

Investigation – Heap Space: The container originally ran with 4 C/8 G, allocating 4 G heap (‑Xms4g ‑Xmx4g …). Frequent Full GC (FGC) was observed during spikes, leading to the hypothesis that double heap usage during the switch caused the issue. Increasing the container to 4 C/16 G and allocating a 10 G heap eliminated FGC, but RPC success rate still fell (100% → 97%) and thread count rose slightly.

Investigation – CPU Adaptive Rate Limiting: Sentinel logs showed CPU spikes (up to 415% on a 4‑core node) at the same time, suggesting auto‑throttling. However, monitoring showed only ~20% CPU usage, and disabling Sentinel throttling did not stop the latency, indicating throttling was not the root cause.

Source Code Insight: The category client loads attribute data lazily. The first request after a package switch triggers a mmap of the store file and reads it into memory. The relevant code is:

MappedByteBuffer mappedBuffer = storeFile.getFileChannel().map(MapMode.READ_ONLY, segmentPosition, segmentSize); // create mmap mapping
mappedBuffer.load(); // load data into memory
Buffer buffer = FixedByteBuffer.wrap(mappedBuffer);
for (int i = (t == 0 ? 0 : blockIdFlags[t - 1]); i < blockIdFlags[t]; i++) {
    getReadBlock(i); // pre‑create blockId → block buffer mapping
}

This loading runs in a business thread pool, not the Netty handler thread, and can block the thread for several seconds.

Page Fault Analysis: The load operation triggers many major page faults because the 1.7 GB store file is not in memory. Major page faults force the JVM process to sleep while the kernel reads data from disk, causing the observed multi‑second hangs. Sar and pmap measurements confirmed a surge of major page faults and high RSS after the load.

Conclusion: The latency is not caused by GC or Sentinel throttling but by the lazy loading of a large off‑heap store file, which incurs major page faults and blocks the JVM process for seconds.

Final Solution: Upgrade to the latest category client version, which (1) pre‑loads the store before the package switch, and (2) loads the store in 64 MB chunks with short sleeps between chunks, avoiding a single massive page‑fault‑induced pause.

debuggingJVMperformancecontainermemorypage fault
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.