Backend Development 13 min read

Analysis of Service Avalanche Caused by Jedis Parameter Misconfiguration During Redis Cluster Failover

During a Redis 3.x cluster master‑slave failover, the default Jedis connection timeout of two seconds combined with six automatic retries caused each request’s Redis calls to accumulate up to sixty seconds of latency, triggering Nginx timeouts and a service‑avalanche, which was resolved by lowering timeout and retry settings.

Sohu Tech Products

Aug 23, 2023

Analysis of Service Avalanche Caused by Jedis Parameter Misconfiguration During Redis Cluster Failover

Redis is widely used as a remote cache in Internet services, and Jedis is a popular Java client for accessing Redis. This article analyzes how unreasonable Jedis parameter settings during a master‑slave failover in Redis 3.x cluster mode can trigger a service avalanche.

Background

The author’s project runs a Redis 3.x cluster with multiple master‑slave nodes and uses Jedis as the client. A physical‑machine failure caused a master‑slave switch in one cluster node, which activated Jedis’s retry mechanism and eventually led to a service avalanche.

Fault Record

During the incident, an MQ message‑queue accumulation alarm was triggered (value 159 412, threshold > 100 000). System monitoring showed a sharp drop in request volume and an average interface response time close to 60 seconds. Downstream services reported that Redis access volume dropped to near zero, while the average Redis response time remained around 2 seconds. Thread counts in the waiting state increased dramatically.

Further investigation confirmed that the Redis cluster performed a master‑slave switch at the same time as the service degradation.

Failure Process Analysis

The analysis focuses on three questions:

Why does increased interface latency cause a steep drop in request volume?

Why does the latency during Redis failover hover around 2 seconds?

Why does the average interface response time approach 60 seconds?

1. Traffic Drop : Nginx logs showed many "connection timed out" errors, leading Nginx to mark the backend as unavailable ("no live upstreams"), which caused the request volume to fall.

2. Latency Issue : Jedis threw a "connect timed out" exception during connection acquisition. The default connection timeout in Jedis is DEFAULT_TIMEOUT = 2000 ms.

3. Retry Mechanism : Jedis performed six retries, each taking roughly the default 2 seconds, resulting in about 12 seconds per retry. Since a single external request may invoke Redis five times, the total response time can reach 60 seconds, matching the observed latency.

Jedis Execution Flow

The following code snippets illustrate the key parts of Jedis’s internal processing.

<redis-cluster name="redisCluster" timeout="3000" maxRedirections="6">
    <properties>
        <property name="maxTotal" value="20" />
        <property name="maxIdle" value="20" />
        <property name="minIdle" value="2" />
    </properties>
</redis-cluster>

public class JedisCluster extends BinaryJedisCluster implements JedisCommands, MultiKeyJedisClusterCommands, JedisClusterScriptingCommands {
    @Override
    public String set(final String key, final String value, final String nxxx, final String expx, final long time) {
        return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
            @Override
            public String execute(Jedis connection) {
                // actual command execution
                return connection.set(key, value, nxxx, expx, time);
            }
        }.run(key);
    }
}

public abstract class JedisClusterCommand<T> {
    public abstract T execute(Jedis connection);
    public T run(String key) {
        return runWithRetries(SafeEncoder.encode(key), this.maxAttempts, false, false);
    }
    private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
        Jedis connection = null;
        try {
            if (asking) {
                // omitted
            } else {
                if (tryRandomNode) {
                    connection = connectionHandler.getConnection();
                } else {
                    connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
                }
            }
            return execute(connection);
        } catch (JedisConnectionException e) {
            // retry logic
            if (attempts <= 1) {
                connectionHandler.renewSlotCache();
                throw e;
            }
            return runWithRetries(key, attempts - 1, tryRandomNode, asking);
        } finally {
            releaseConnection(connection);
        }
    }
}

public final class JedisClusterCRC16 {
    public static int getSlot(byte[] key) {
        // slot calculation logic
    }
}

public class JedisSlotBasedConnectionHandler extends JedisClusterConnectionHandler {
    @Override
    public Jedis getConnectionFromSlot(int slot) {
        JedisPool connectionPool = cache.getSlotPool(slot);
        if (connectionPool != null) {
            return connectionPool.getResource();
        } else {
            renewSlotCache();
            connectionPool = cache.getSlotPool(slot);
            if (connectionPool != null) {
                return connectionPool.getResource();
            } else {
                return getConnection();
            }
        }
    }
}

protected Connection sendCommand(final Command cmd, final byte[]... args) {
    try {
        connect();
        Protocol.sendCommand(outputStream, cmd, args);
        pipelinedCommands++;
        return this;
    } catch (JedisConnectionException ex) {
        broken = true;
        throw ex;
    }
}

Key Findings

The service avalanche was primarily caused by Jedis’s default connection timeout (2 seconds) and a high maximum retry count (6). Each retry added significant latency, and multiple Redis calls per request compounded the problem.

Recommendations

Set maxAttempts to a reasonable value (e.g., 2).

Configure connectionTimeout and soTimeout according to the expected latency (e.g., 100 ms).

Monitor Redis cluster health and adjust client parameters promptly after failover events.

By applying these settings (connection/reading timeout = 100 ms, max retry = 2), the author’s production environment reduced the worst‑case latency to about 1 second, effectively preventing service avalanches during Redis node failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development Redis Jedis Connection Retry Service Avalanche Cluster Failover

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.