Investigation of JSF Thread‑Pool Exhaustion During R2M Redis Upgrade
This article details a step‑by‑step investigation of a JSF thread‑pool exhaustion error that occurred when upgrading the Redis version of JD's internal R2M distributed cache, analyzing stack traces, lock contention, ForkJoinPool behavior, and the eventual remediation steps.
The issue originated when upgrading the Redis version of JD's internal distributed cache service R2M, where a few nodes began reporting the error RpcException: [Biz thread pool of provider has been exhausted] . Monitoring showed the problem was isolated to one or two nodes, prompting an immediate shutdown of the affected nodes via the JSF framework.
Log excerpts captured the error:
24-03-13 02:21:20.188 [JSF-SEV-WORKER-57-T-5] ERROR BaseServerHandler - handlerRequest error msg:[JSF-23003] Biz thread pool of provider has been exhausted, the server port is 22003 24-03-13 02:21:20.658 [JSF-SEV-WORKER-57-T-5] WARN BusinessPool - [JSF-23002] Task:com.alibaba.ttl.TtlRunnable - com.jd.jsf.gd.server.JSFTask@0 has been reject for ThreadPool exhausted! pool:80, active:80, queue:300, taskcnt: 1067777Initial analysis suggested that the JSF thread pool, sized statically at service start, became saturated when incoming traffic exceeded the available threads, leaving no thread to handle new requests.
Further investigation using SGM stack dumps and an online thread‑dump analyzer revealed that many JSF threads were blocked inside JedisClusterInfoCache#getSlaveOfSlotFromDc , which acquires a read lock at method entry. The read lock is paired with a write lock that is held by a periodic topology‑update task.
The topology‑update task acquires the write lock, performs a Redis topology refresh, and then releases it. However, the task did not release the write lock properly, and because the read lock had no timeout, worker threads remained blocked waiting for the read lock.
Additional analysis showed that the application relied on parallelStream().forEach and Caffeine’s asynchronous refresh, both of which default to ForkJoinPool.commonPool() . The common pool size (CPU cores − 1) was insufficient for the workload, causing the worker threads to compete for the same lock and leading to dead‑lock‑like behavior.
Verification confirmed that three ForkJoinPool.commonPool‑worker threads were stuck waiting for the Redis connection lock, while the topology‑updater thread was blocked in the for‑each business logic.
Root cause: improper use of shared thread pools without custom sizing or timeout settings, combined with a write‑lock that was not released, resulted in thread‑pool exhaustion and service disruption.
Remediation steps included synchronizing the topology‑update operation (making it synchronous), configuring dedicated thread pools for Caffeine refresh and parallel streams, and adding proper lock timeout handling.
Key takeaways: when using asynchronous processing in Java, always configure thread‑pool sizes and timeouts; monitor lock usage; and ensure that long‑running tasks do not hold write locks indefinitely.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.