Operations 16 min read

Performance Troubleshooting: JVM Safepoint, Biased Locking, Netty Worker Blocking, and Log Framework Issues

This article details a systematic investigation of various backend performance problems—including CMS GC delays, biased‑lock revocation, Netty worker thread blocking, and log‑framework overhead—provides root‑cause analysis, JVM tuning parameters, and practical recommendations to mitigate latency spikes in production services.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Performance Troubleshooting: JVM Safepoint, Biased Locking, Netty Worker Blocking, and Log Framework Issues

Before diving in, the author references a previous case (Case 1) where a CMS GC issue was traced to a mis‑configured c3p0 pool, highlighting the difficulty of acquiring performance‑troubleshooting skills from a single article.

The article then lists common performance‑issue categories such as long biased‑lock revocation, Netty misuse, Redis host performance, GC‑induced timeouts, HTTP/1.1 head‑of‑line blocking, thread‑pool queuing, CPU‑time‑slice starvation, and proxy (nginx) configuration problems.

Problem Detection: Client Runtime Significantly Higher Than Server

In Case 1 the author solved a c3p0 configuration error that caused rapid old‑generation growth and long CMS GC pauses. Monitoring continued to reveal similar symptoms.

AB communication is illustrated: A uses HttpClient, B runs Netty as an HTTP server. When B does not start processing immediately, several possible causes are considered.

Potential stop‑the‑world (STW) events in B.

JVM parameters to expose safepoint information: -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500 -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCTimeStamps

Other causes such as Netty, HttpClient, Hystrix queuing.

Problem One: Irregular Log Output Disrupts Diagnosis

Adding safepoint logs revealed that only part of the output was captured because an ops script redirected stdout to a Logback‑managed file, which then got rolled over daily.

<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
  <fileNamePattern>server.%d{yyyy-MM-dd}.log</fileNamePattern>
  <maxHistory>15</maxHistory>
</rollingPolicy>

The script java xxx >server.log locks stdout to the original file, so after Logback rolls the file the redirection continues to the renamed file, causing missing logs.

Recommendations:

Avoid stdout logging in production.

Do not configure logging frameworks to output to stdout.

If stdout must be used, write to a uniquely timestamped file or append‑only log.

Problem Two: Biased‑Lock Revocation Causing Timeouts

GC logs showed a 1.5 s stop‑the‑world pause; safepoint logs displayed a RevokeBias operation lasting 1542 ms.

Because the cost of revoking a biased lock can be high, the author disabled biased locking with the JVM flag:

-XX:-UseBiasedLocking

Problem Three: Netty Worker Thread Blocking

Even after disabling biased locking, timeouts persisted. Investigation revealed that Netty’s workerGroup threads sometimes processed blocking business logic, leading to request queuing.

Example code for calculating worker thread count:

private int workerThreadSize = Runtime.getRuntime().availableProcessors() * 2 - 1;

With 48 logical cores, the pool size becomes 95, yet blocking operations in the worker threads still cause latency. The author suggests separating business logic into its own asynchronous thread pool.

Further analysis of request traces showed that an nginx proxy added additional latency, and limited visibility into nginx logs (four nodes, no ELK) hampered debugging.

Problem Four: Log‑Framework Performance Overhead

At midnight, runtime spikes were observed. The root cause was Logback’s daily rolling policy, which performs a synchronized file‑rename and deletion of old logs. Deleting large (≈100 GB) log files caused several‑second pauses.

Upgrading to Logback 1.2.3, which uses asynchronous deletion, resolves the issue.

Summary

The article reviews three representative performance problems—CMS GC delays, biased‑lock revocation, and Netty worker thread blocking—explaining how each can produce large upstream‑downstream runtime gaps, and emphasizes the importance of understanding OS‑level TCP buffering versus application‑level blocking. It also highlights the difficulty of diagnosing log‑framework bottlenecks and recommends using profiling tools like SkyWalking to capture thread snapshots without instrumentation.

backendJVMperformancenettySafepointloggingtroubleshooting
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.