Root Cause Analysis of Memory Leak and High Latency in a Netty‑Based Real‑Time Risk Control System Using JDK 17 ZGC
This article investigates the severe memory growth and latency spikes observed when synchronizing data across data centers in a Netty‑driven online computation service, analyzes the impact of JDK 17 ZGC and direct‑buffer allocation, and presents the debugging steps, source‑code insights, and configuration changes that ultimately resolved the issue.
The Tianwang risk‑control system is an in‑memory, high‑throughput online computation service built on Netty for TCP communication between client and server. Initial optimizations achieved >20 wqps per core, but under a cross‑data‑center test the server exhibited rapid memory growth, persistent 20% CPU usage, and frequent GC pauses.
To address the latency problem, the team upgraded to JDK 11+ ZGC and later JDK 17, noting that ZGC can reduce pause times to sub‑millisecond levels. However, the issue persisted, prompting a deeper investigation.
Investigation Steps
Enabled Netty leak detection (PARANOID) – no leak logs were produced.
Observed that WriteTask objects accumulated in Netty's MpscUnboundedArrayQueue , causing memory bloat.
Compared JDK 8 and JDK 17 behavior; the problem disappeared on JDK 8.
Debug logs revealed a critical message about the direct‑buffer constructor being unavailable:
[2023-08-23 11:16:16.163] DEBUG [] - io.netty.util.internal.PlatformDependent0 - direct buffer constructor: unavailable: Reflective setAccessible(true) disabledSource‑code analysis showed Netty allocates direct memory via PooledByteBufAllocator . When PlatformDependent.useDirectBufferNoCleaner() returns false (the default on JDK 17 without special JVM flags), Netty falls back to ByteBuffer.allocateDirect , which triggers synchronous System.gc() and can block EventLoop threads when direct memory is exhausted.
Key code excerpts:
protected PoolChunk<ByteBuffer> newChunk() {
// critical code
ByteBuffer memory = allocateDirect(chunkSize);
} PlatformDependent.useDirectBufferNoCleaner() ?
PlatformDependent.allocateDirectNoCleaner(capacity) :
ByteBuffer.allocateDirect(capacity); if (maxDirectMemory == 0 || !hasUnsafe() || !PlatformDependent0.hasDirectBufferNoCleanerConstructor()) {
USE_DIRECT_BUFFER_NO_CLEANER = false;
} else {
USE_DIRECT_BUFFER_NO_CLEANER = true;
}On JDK 9+ the constructor java.nio.DirectByteBuffer(long, int) is only accessible when the JVM is started with -io.netty.tryReflectionSetAccessible and the module is opened via --add-opens=java.base/java.nio=ALL-UNNAMED . Without these flags, the constructor is unavailable, leading to the problematic allocation path.
The root cause was identified as the EventLoop’s WriteTask being blocked while waiting for direct‑memory allocation, causing a backlog of unflushed entries and massive memory consumption.
Solution steps included:
Adding a connection pool and random channel selection for cross‑data‑center sync to increase parallelism.
Enabling the required JVM flags ( -io.netty.tryReflectionSetAccessible and --add-opens=java.base/java.nio=ALL-UNNAMED ) so Netty can use allocateDirectNoCleaner .
Monitoring non‑heap memory accurately and adding write‑and‑flush error listeners to detect OutOfMemoryError early.
Additional reflections highlighted the importance of proper back‑pressure handling (low/high watermarks) and the need to align monitoring metrics with actual direct‑memory usage.
Overall, the case demonstrates how JVM version differences, Netty’s memory allocation strategy, and missing JVM options can combine to produce severe latency and memory‑leak‑like symptoms in high‑throughput backend services.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.