Backend Development 13 min read

Why Our Redis Cluster Pipeline Deadlocked: Thread Locks Explained

This article walks through a production incident where a Redis Cluster pipeline caused Dubbo threads to block and eventually deadlock, detailing the root‑cause analysis, code inspection, and verification steps using jstack, jmap, and MAT to confirm the deadlock and propose fixes.

Efficient Ops
Efficient Ops
Efficient Ops
Why Our Redis Cluster Pipeline Deadlocked: Thread Locks Explained

1. Background

Redis Pipeline is an efficient batch‑command mechanism that reduces network latency and improves read/write throughput. Redis Cluster Pipeline extends this to a Redis Cluster, packaging multiple operations and sending them to several nodes at once.

The project uses a pipeline to batch‑query reservation game information from a Redis Cluster via an internal

JedisClusterPipeline

utility.

2. Incident Record

An alert indicated that a Dubbo thread pool was exhausted. Only one machine showed the problem, and the number of completed tasks never increased.

Monitoring revealed that request counts dropped to zero, confirming the machine had hung. Arthas showed all 400 Dubbo threads in a

waiting

state.

3. Fault Analysis

3.1 Threads waiting for a connection

The thread stack traces showed they were blocked inside

org.apache.commons.pool2.impl.GenericObjectPool#borrowObject(long)

. Because the pool’s

blockWhenExhausted

default is

true

and

borrowMaxWaitMillis

was not set (default

-1

), threads waited indefinitely for an idle connection.

<code>public T borrowObject(long borrowMaxWaitMillis) throws Exception {
    // ...
    while (p == null) {
        if (blockWhenExhausted) {
            p = idleObjects.pollFirst();
            if (p == null) {
                if (borrowMaxWaitMillis < 0) {
                    p = idleObjects.takeFirst(); // blocks forever
                } else {
                    p = idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS);
                }
            }
            // ...
        }
    }
    return p.getObject();
}
</code>

Since the business code did not set

borrowMaxWaitMillis

, threads kept waiting for a connection.

3.2 Threads unable to obtain a connection

Two possibilities were considered: inability to create a Redis connection, or all connections in the pool being occupied. Network jitter was observed, but the problematic machine could still connect to Redis, ruling out the first case.

Connection‑leakage was examined; the project uses Jedis 2.9.0, which does not exhibit the known leak in version 2.10.0.

3.3 Potential deadlock

Without a timeout, pipeline mode can cause a deadlock when threads acquire connections from multiple pools in different orders (the classic “hold‑and‑wait” condition). In the example, four threads each need connections from two Redis nodes; opposite acquisition order leads to circular waiting.

4. Deadlock Proof

4.1 Identify which pool each thread is waiting on

Using

jstack

and

jmap

, the lock address each thread waited for was extracted (e.g., thread 383 waiting on

0x6a3305858

). MAT was then used to trace the lock back to a specific

JedisPool

instance.

4.2 Identify which pools each thread currently holds

MAT searched for all

JedisClusterPipeline

objects (one per Dubbo thread). The

poolToJedisMap

field revealed which connection pools each pipeline held connections from.

4.3 Analyze deadlock conditions

Out of 12 Redis master nodes, all 400 Dubbo threads were waiting on only five connection pools, each configured with a size of 20. Those five pools already had 100 connections occupied, leaving no free connections for the remaining threads, confirming a deadlock.

5. Summary

The article demonstrates a systematic approach to diagnosing a production failure: capturing heap and thread dumps, using Arthas for live inspection, reading source code to understand blocking behavior, forming hypotheses, and finally confirming a deadlock with MAT by correlating waiting locks and held connections. It highlights the importance of configuring connection‑pool timeouts and sizing pools appropriately to avoid similar deadlocks.

JavaPerformancedeadlockRedisJedisTroubleshootingpipeline
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.