Optimizing Log4j2 Asynchronous Logging: Configuration, Diagnosis, and Load‑Testing
This article presents a detailed case study of a severe service outage caused by Log4j2 asynchronous logging bottlenecks, explains step‑by‑step diagnostics of JVM, disk, and RingBuffer metrics, and demonstrates how adjusting log4j2.asyncQueueFullPolicy and log4j2.discardThreshold dramatically improves recovery time during load testing.
The incident began on 2023‑12‑15 when a critical RPC service’s availability dropped from 100% to 0.72% after a sudden surge of error logs; rapid investigation traced the issue to the dependent system’s failure and subsequent logging overload.
Initial checks of GC activity showed increased Young GC counts but no Full GC, while CPU and thread usage remained within acceptable limits. Disk I/O metrics appeared normal, though brief spikes in usage were observed.
Deep analysis of the client’s JVM memory dump revealed that many threads were blocked in the Log4j2 enqueue method, causing a massive RingBuffer (≈1.61 GB) to fill with WARN and ERROR events. The filled RingBuffer triggered lock contention via AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull , turning parallel processing into a serial bottleneck.
Key Log4j2 settings identified were:
log4j2.asyncQueueFullPolicy=Discard – discard logs when the queue is full instead of blocking.
log4j2.discardThreshold=ERROR – only discard logs at ERROR level or lower.
Load‑testing verified the impact of different log4j2.discardThreshold values:
// Example configuration
log4j2.discardThreshold=INFOWith INFO , the client required a restart to recover availability. Setting the threshold to WARN reduced recovery time to about 8 minutes, while ERROR or FATAL restored full availability within 2 minutes without manual intervention.
The article also provides a concise overview of Log4j2’s asynchronous logging architecture using Disruptor’s RingBuffer, and a table of recommended configuration parameters (e.g., RingBuffer size, wait strategies, retry counts).
In conclusion, combining log4j2.asyncQueueFullPolicy=Discard with log4j2.discardThreshold=ERROR , avoiding blocking appenders such as KafkaAppender in production, and disabling immediate flush for batch writes ensure that logging does not become a system bottleneck under extreme load.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.