Backend Development 12 min read

Diagnosing and Resolving Random JVM Hang Issues in a High-Concurrency Application

The article outlines a six‑step method for diagnosing sporadic JVM hangs in a high‑concurrency Xianyu service—starting with code review, live state capture, I/O checks, lock analysis, resource‑exhaustion assessment, and finally framework thread‑pool tuning—to uncover lost‑lock behavior and severe thread‑pool imbalance that cause prolonged lock waits despite low CPU load.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Diagnosing and Resolving Random JVM Hang Issues in a High-Concurrency Application

Background: A core application of Xianyu occasionally experiences JVM instances that become suspended, with many threads waiting for a lock that no thread holds. The affected machines show low CPU load but a large number of threads.

Problem analysis highlights random occurrence in the cluster, unpredictable timing, low frequency (once every 1‑2 days), complex symptoms (high thread count, low load, lock waiting), and difficulty in reproducing or capturing the issue.

Solution: A six‑step systematic approach is proposed:

Step 1 – Code bug investigation: examine business logic and logs for obvious errors.

Step 2 – On‑site capture: monitor JVM state in real time using agents (C‑level or Java agents) or JMS, noting the limitations of intrusive monitoring.

Step 3 – IO hang check: verify I/O performance via container metrics and flame graphs.

Step 4 – Lock analysis: differentiate deadlock (unlikely) from “lost lock” related to coroutines; examine stack files.

Step 5 – Resource exhaustion: assess software resources (heap, metaspace) and hardware resources (CPU, memory). Memory pressure caused threads to spill to disk, leading to prolonged lock acquisition.

Step 6 – Framework source review: analyze thread‑pool configurations in HSF, Netty, Mtop, etc., discovering an imbalanced Provider‑to‑Consumer thread‑pool ratio (≈200:1) that amplifies loop‑call effects.

Experimental verification using Alibaba PAS load‑testing showed severe latency (average RT ≈2500 ms) and low TPS, confirming that even modest traffic can trigger the loop‑call degradation.

Conclusion: When JVM stalls occur, start with business‑code review, then capture live state, and investigate IO, locks, and resource limits. Framework‑level thread‑pool tuning is essential to prevent cascading failures.

Key code snippets used in lock handling:

publicvoid callAppenders(LoggingEvent event) {
int writes = 0;
for (Category c = this; c != null; c = c.parent) {
 // Protected against simultaneous call to addAppender, removeAppender,...
 synchronized(c) {
   if (c.aai != null) {
     writes += c.aai.appendLoopOnAppenders(event);
   }
   if (!c.additive) {
     break;
   }
 }
}
if (writes == 0) {
    repository.emitNoAppenderWarning(this);
}
}
for (p = w ; p != NULL ; p = p->_next) {
  guarantee (p->TState == ObjectWaiter::TS_CXQ, "Invariant");
  p->TState = ObjectWaiter::TS_ENTER;
  p->_prev = q ;
  q = p ;
}
if (_EntryList != NULL) {
    q->_next = _EntryList;
    _EntryList->_prev = q ;
}
debuggingJVMlock analysisperformanceresource exhaustionThread Dump
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.