Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch
A service‑wide avalanche occurred when a Redis 3.x master‑slave failover coincided with Jedis’ default 2‑second connection timeout and six retry attempts, causing up to 60‑second latencies; adjusting connectionTimeout, soTimeout to 100 ms and reducing maxAttempts to two limited latency to about one second and prevented cascade failures.
Redis is widely used as a remote cache in Internet services, and Jedis is a popular Java client for accessing Redis. This article analyzes a service‑wide avalanche that occurred when a Redis 3.x cluster experienced a master‑slave failover and Jedis was configured with inappropriate timeout and retry parameters.
Background
The author's project runs a Redis 3.x cluster in multi‑node mode with master‑slave pairs, using Jedis as the client. During a host failure, a master‑slave switch was triggered, which activated Jedis' retry mechanism and eventually caused a service avalanche.
Fault Record
Monitoring showed a massive message‑queue backlog alarm, a sharp drop in request volume, and average interface latency approaching 60 seconds. Thread dumps indicated a large increase in threads waiting, and Redis access volume dropped to near zero.
Further investigation revealed that the Redis cluster had performed a master‑slave switch at the same time as the outage.
Analysis of the Failure Process
1. Traffic Drop : Nginx logs contained many "connection timed out" errors, leading Nginx to mark upstream services as unavailable ("no live upstreams"), which reduced request volume.
2. Latency Issue : Jedis threw connect timed out exceptions during connection acquisition. The default connection timeout in Jedis is 2000 ms. Each retry (default maxAttempts = 6) added roughly 2 s, and with 5 Redis calls per request, total latency could reach ~60 s.
Jedis Execution Flow
The client creates a JedisClusterCommand object for each command, which invokes runWithRetries to handle retries. The flow includes slot calculation, connection acquisition, command sending, and retry handling.
<redis-cluster name="redisCluster" timeout="3000" maxRedirections="6"> <properties> <property name="maxTotal" value="20" /> <property name="maxIdle" value="20" /> <property name="minIdle" value="2" /> </properties> </redis-cluster>
Key source snippets from Jedis are reproduced below.
public class JedisCluster extends BinaryJedisCluster implements JedisCommands, MultiKeyJedisClusterCommands, JedisClusterScriptingCommands { @Override public String set(final String key, final String value, final String nxxx, final String expx, final long time) { return new JedisClusterCommand (connectionHandler, maxAttempts) { @Override public String execute(Jedis connection) { // actual command execution return connection.set(key, value, nxxx, expx, time); } }.run(key); // trigger execution } } public abstract class JedisClusterCommand { public abstract T execute(Jedis connection); public T run(String key) { // execute with retry logic return runWithRetries(SafeEncoder.encode(key), this.maxAttempts, false, false); } }
public final class JedisClusterCRC16 { public static int getSlot(byte[] key) { int s = -1; int e = -1; boolean sFound = false; for (int i = 0; i < key.length; i++) { if (key[i] == '{' && !sFound) { s = i; sFound = true; } if (key[i] == '}' && sFound) { e = i; break; } } if (s > -1 && e > -1 && e != s + 1) { return getCRC16(key, s + 1, e) & (16384 - 1); } return getCRC16(key) & (16384 - 1); } }
public class JedisSlotBasedConnectionHandler extends JedisClusterConnectionHandler { @Override public Jedis getConnectionFromSlot(int slot) { JedisPool connectionPool = cache.getSlotPool(slot); if (connectionPool != null) { return connectionPool.getResource(); } else { renewSlotCache(); connectionPool = cache.getSlotPool(slot); if (connectionPool != null) { return connectionPool.getResource(); } else { return getConnection(); } } } }
public class Connection implements Closeable { public void connect() { if (!isConnected()) { try { socket = new Socket(); socket.setReuseAddress(true); socket.setKeepAlive(true); socket.setTcpNoDelay(true); socket.setSoLinger(true, 0); // set connection timeout (DEFAULT_TIMEOUT = 2000) socket.connect(new InetSocketAddress(host, port), connectionTimeout); // set read timeout socket.setSoTimeout(soTimeout); outputStream = new RedisOutputStream(socket.getOutputStream()); inputStream = new RedisInputStream(socket.getInputStream()); } catch (IOException ex) { broken = true; throw new JedisConnectionException(ex); } } } }
protected Connection sendCommand(final Command cmd, final byte[]... args) { try { // ensure connection connect(); // send command in Redis protocol format Protocol.sendCommand(outputStream, cmd, args); pipelinedCommands++; return this; } catch (JedisConnectionException ex) { broken = true; throw ex; } }
private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) { Jedis connection = null; try { if (asking) { // omitted } else { if (tryRandomNode) { connection = connectionHandler.getConnection(); } else { connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key)); } } return execute(connection); } catch (JedisConnectionException jce) { releaseConnection(connection); connection = null; if (attempts <= 1) { connectionHandler.renewSlotCache(); throw jce; } return runWithRetries(key, attempts - 1, tryRandomNode, asking); } finally { releaseConnection(connection); } }
Key Findings
maxAttempts controls the maximum number of retry attempts.
connectionTimeout defines the connection establishment timeout.
soTimeout defines the read timeout.
Conclusion
The service avalanche was triggered by Jedis' default retry mechanism combined with a 2 s connection timeout during a Redis master‑slave switch. By tuning Jedis parameters—setting connectionTimeout and soTimeout to 100 ms and reducing maxAttempts to 2—the maximum latency for a failed Redis access can be limited to about 1 s, effectively preventing a cascade failure.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.