Databases 13 min read

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

A service‑wide avalanche occurred when a Redis 3.x master‑slave failover coincided with Jedis’ default 2‑second connection timeout and six retry attempts, causing up to 60‑second latencies; adjusting connectionTimeout, soTimeout to 100 ms and reducing maxAttempts to two limited latency to about one second and prevented cascade failures.

vivo Internet Technology

Jul 19, 2023

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

Redis is widely used as a remote cache in Internet services, and Jedis is a popular Java client for accessing Redis. This article analyzes a service‑wide avalanche that occurred when a Redis 3.x cluster experienced a master‑slave failover and Jedis was configured with inappropriate timeout and retry parameters.

Background

The author's project runs a Redis 3.x cluster in multi‑node mode with master‑slave pairs, using Jedis as the client. During a host failure, a master‑slave switch was triggered, which activated Jedis' retry mechanism and eventually caused a service avalanche.

Fault Record

Monitoring showed a massive message‑queue backlog alarm, a sharp drop in request volume, and average interface latency approaching 60 seconds. Thread dumps indicated a large increase in threads waiting, and Redis access volume dropped to near zero.

Further investigation revealed that the Redis cluster had performed a master‑slave switch at the same time as the outage.

Analysis of the Failure Process

1. Traffic Drop : Nginx logs contained many "connection timed out" errors, leading Nginx to mark upstream services as unavailable ("no live upstreams"), which reduced request volume.

2. Latency Issue : Jedis threw connect timed out exceptions during connection acquisition. The default connection timeout in Jedis is 2000 ms. Each retry (default maxAttempts = 6) added roughly 2 s, and with 5 Redis calls per request, total latency could reach ~60 s.

Jedis Execution Flow

The client creates a JedisClusterCommand object for each command, which invokes runWithRetries to handle retries. The flow includes slot calculation, connection acquisition, command sending, and retry handling.

<redis-cluster name="redisCluster" timeout="3000" maxRedirections="6">
    <properties>
        <property name="maxTotal" value="20" />
        <property name="maxIdle" value="20" />
        <property name="minIdle" value="2" />
    </properties>
</redis-cluster>

Key source snippets from Jedis are reproduced below.

public class JedisCluster extends BinaryJedisCluster implements JedisCommands, MultiKeyJedisClusterCommands, JedisClusterScriptingCommands {
    @Override
    public String set(final String key, final String value, final String nxxx, final String expx, final long time) {
        return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
            @Override
            public String execute(Jedis connection) {
                // actual command execution
                return connection.set(key, value, nxxx, expx, time);
            }
        }.run(key); // trigger execution
    }
}

public abstract class JedisClusterCommand<T> {
    public abstract T execute(Jedis connection);
    public T run(String key) {
        // execute with retry logic
        return runWithRetries(SafeEncoder.encode(key), this.maxAttempts, false, false);
    }
}

public final class JedisClusterCRC16 {
    public static int getSlot(byte[] key) {
        int s = -1;
        int e = -1;
        boolean sFound = false;
        for (int i = 0; i < key.length; i++) {
            if (key[i] == '{' && !sFound) { s = i; sFound = true; }
            if (key[i] == '}' && sFound) { e = i; break; }
        }
        if (s > -1 && e > -1 && e != s + 1) {
            return getCRC16(key, s + 1, e) & (16384 - 1);
        }
        return getCRC16(key) & (16384 - 1);
    }
}

public class JedisSlotBasedConnectionHandler extends JedisClusterConnectionHandler {
    @Override
    public Jedis getConnectionFromSlot(int slot) {
        JedisPool connectionPool = cache.getSlotPool(slot);
        if (connectionPool != null) {
            return connectionPool.getResource();
        } else {
            renewSlotCache();
            connectionPool = cache.getSlotPool(slot);
            if (connectionPool != null) {
                return connectionPool.getResource();
            } else {
                return getConnection();
            }
        }
    }
}

public class Connection implements Closeable {
    public void connect() {
        if (!isConnected()) {
            try {
                socket = new Socket();
                socket.setReuseAddress(true);
                socket.setKeepAlive(true);
                socket.setTcpNoDelay(true);
                socket.setSoLinger(true, 0);
                // set connection timeout (DEFAULT_TIMEOUT = 2000)
                socket.connect(new InetSocketAddress(host, port), connectionTimeout);
                // set read timeout
                socket.setSoTimeout(soTimeout);
                outputStream = new RedisOutputStream(socket.getOutputStream());
                inputStream = new RedisInputStream(socket.getInputStream());
            } catch (IOException ex) {
                broken = true;
                throw new JedisConnectionException(ex);
            }
        }
    }
}

protected Connection sendCommand(final Command cmd, final byte[]... args) {
    try {
        // ensure connection
        connect();
        // send command in Redis protocol format
        Protocol.sendCommand(outputStream, cmd, args);
        pipelinedCommands++;
        return this;
    } catch (JedisConnectionException ex) {
        broken = true;
        throw ex;
    }
}

private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
    Jedis connection = null;
    try {
        if (asking) {
            // omitted
        } else {
            if (tryRandomNode) {
                connection = connectionHandler.getConnection();
            } else {
                connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
            }
        }
        return execute(connection);
    } catch (JedisConnectionException jce) {
        releaseConnection(connection);
        connection = null;
        if (attempts <= 1) {
            connectionHandler.renewSlotCache();
            throw jce;
        }
        return runWithRetries(key, attempts - 1, tryRandomNode, asking);
    } finally {
        releaseConnection(connection);
    }
}

Key Findings

maxAttempts controls the maximum number of retry attempts.

connectionTimeout defines the connection establishment timeout.

soTimeout defines the read timeout.

Conclusion

The service avalanche was triggered by Jedis' default retry mechanism combined with a 2 s connection timeout during a Redis master‑slave switch. By tuning Jedis parameters—setting connectionTimeout and soTimeout to 100 ms and reducing maxAttempts to 2—the maximum latency for a failed Redis access can be limited to about 1 s, effectively preventing a cascade failure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Redis Jedis Cluster fault-analysis Connection Retry Service Avalanche

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.