Analysis of Redis Sentinel Failover Issue in Redis 7.4.0 and Resolution via Pub/Sub ACL Adjustment
This article investigates a Redis Sentinel failover anomaly in version 7.4.0 where the sentinel repeatedly elects a failed master, explains the underlying s_down/o_down states, examines network, configuration, and ACL settings, and resolves the issue by adjusting Pub/Sub permissions to allow proper failover.
Background
The test environment is simplified to a one‑master, one‑slave, one‑sentinel topology, with the sentinel monitoring the master named ms . The nodes are:
master: 172.20.134.2
slave: 172.20.134.3
sentinel: 172.20.134.4
Problem Description
When running Redis 7.4.0 with sentinel, after the master instance fails, the sentinel makes abnormal decisions and continues to elect the failed master as the new leader.
21903:X 06 Dec 2024 15:53:04.164 * +slave slave 172.20.134.3:6379 172.20.134.3 6379 @ ms 172.20.134.2 6379
21903:X 06 Dec 2024 15:53:04.168 * Sentinel new configuration saved on disk
# +sdown master ms 172.20.134.2 6379
# +odown master ms 172.20.134.2 6379 #quorum 1/1
# +new-epoch 1
# +try-failover master ms 172.20.134.2 6379
... (subsequent failover attempts) ...The logs show that after the master process is kill ed, the sentinel repeatedly executes the same failover steps.
Investigation Process
Network Check
Ping and telnet tests between the sentinel node and Redis instances confirmed normal network connectivity.
Sentinel and Redis Configuration Review
The sentinel.conf and redis.conf files were examined. Connection tests using the ACL user markus succeeded.
Sentinel Master State
Running SENTINEL masters on the sentinel revealed that the master flags were s_down , o_down , and disconnected .
Sentinel Slave State
Running SENTINEL SLAVES ms showed that the slave also reported the disconnected flag.
State Definitions
s_down (Subjectively Down): a single sentinel’s view that the instance is unreachable.
o_down (Objectively Down): consensus among a quorum of sentinels, triggering failover.
disconnected : the sentinel cannot maintain a TCP connection to the instance.
Sentinel Configuration Items
down-after-milliseconds : time after which a missing reply marks the instance as s_down .
quorum : number of sentinels required to promote s_down to o_down .
Root Cause Explanation
The slave’s flags=disconnected prevented proper failover. In Redis 6.2+ the ACL system introduced Pub/Sub permissions. By default in Redis 7.4.0 the user’s Pub/Sub permission is set to resetchannels , which blocks the sentinel from subscribing to the internal channels used for state detection.
# Official documentation: https://redis.io/docs/latest/operate/rs/7.4/security/access-control/redis-acl-overview/#pubsub-channels
Pub/sub channels
The & prefix allows access to pub/sub channels (only supported for databases with Redis version 6.2 or later).
To limit access to specific channels, include resetchannels before the allowed channels:Adjusting the Redis configuration to grant full Pub/Sub access (setting [&*] ) allowed the sentinel to receive the necessary messages, and the failover behavior returned to normal.
Resolution
Modify redis.conf to set acl-pubsub-default [&*] (or equivalent ACL rule) so that sentinel can subscribe to all channels. After this change, the sentinel correctly detects the master failure and promotes the slave.
Key Takeaways
Check sentinel and Redis ACL Pub/Sub permissions when failover does not occur.
Understand the progression disconnected → s_down → o_down in sentinel state handling.
Network connectivity alone may not reveal permission‑related issues.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.