Databases 26 min read

How to Build and Understand a Redis Cluster: Setup, Mechanics, and Failover

This guide walks through installing a Redis cluster with three masters and three slaves using local ports, explains slot allocation, key hashing, gossip communication, failover, node addition, resharding, and best practices for high availability, while providing practical commands and configuration examples.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Build and Understand a Redis Cluster: Setup, Mechanics, and Failover

Cluster Environment Setup

Redis Cluster requires at least three master nodes. In this example we create three masters and three slaves using local ports (7000‑7005). This method is for experimentation only and should not be used in production.

Define the ports for the nodes:

7000-7005

and copy

redis.conf

to a separate file for each port.

Configuration files:

IP: 127.0.0.1 Port: 7000‑7005 Config: 7000/redis-7000.conf, 7001/redis-7001.conf, …, 7005/redis-7005.conf

Edit each

redis.conf

to enable clustering and set the required options (e.g.,

requirepass

,

masterauth

if a password is needed).

<code>daemonize yes
# port must match the configuration above
port 7000
cluster-enabled yes
cluster-config-file nodes-7000.conf
cluster-node-timeout 5000
appendonly yes
</code>

Start all nodes:

<code># start all services 7000‑7005
cd 7000
redis-server ./redis-7000.conf
</code>

Initialize the cluster:

<code>redis-cli --cluster create 127.0.0.1:7000 127.0.0.1:7001 \
127.0.0.1:7002 127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005 \
--cluster-replicas 1
</code>

Query cluster status:

<code>redis-cli -c -h 127.0.0.1 -p 7000
cluster info
</code>

Other creation methods are documented in the Redis manual (

utils/create-cluster

).

Cluster Principles

Slot Assignment Mechanism

Redis Cluster divides the key space into 16,384 slots. Each node is responsible for a subset of slots. Clients receive the slot map from the cluster and cache it locally, allowing direct routing of commands to the correct node.

Slot Location Algorithm

The key is hashed with CRC16, and the result is masked with

0x3FFF

to obtain the slot number. The implementation resides in

src/cluster.c

(function

keyHashSlot

).

<code>crc16(key,keylen) &amp; 0x3FFF
</code>

To find the slot of a key:

<code># query the slot of a key
127.0.0.1:7000> cluster keyslot mykey
(integer) 12318
# list all slot ranges
127.0.0.1:7000> cluster slots
…
</code>

Redis automatically redirects the client when a key is accessed on the wrong node (ASK/MOVED redirection).

Redirection (ASK)

If a node receives a command for a key whose slot it does not own, it replies with a special redirection containing the target node address. The client follows the redirect and updates its slot cache.

In plain terms: if the key belongs to another node, the request is forwarded to that node.
<code>set abc sdl
set sbc sdl
</code>

Cluster Communication Mechanism

Nodes communicate via a gossip protocol, exchanging messages such as PING, PONG, MEET, and FAIL. Gossip can be centralized (e.g., using ZooKeeper) or fully distributed.

Centralized

Metadata updates are immediate but can become a bottleneck.

Gossip

Nodes periodically send PING messages containing their state and metadata. MEET adds a new node to the cluster. FAIL notifies others that a node is down.

The gossip approach distributes load but introduces a small delay in metadata propagation.

Gossip Port

Each node uses

port + 10000

for gossip communication (e.g., node 7001 uses 17001).

Cluster Election Principle

When a master fails, its slaves attempt a failover. The process involves broadcasting

FAILOVER_AUTH_REQUEST

, collecting acknowledgments from a majority of masters, and promoting a slave to master.

Slave detects master FAIL.

Slave increments its

currentEpoch

and broadcasts

FAILOVER_AUTH_REQUEST

.

Masters that have not voted yet respond with

FAILOVER_AUTH_ACK

.

Slave collects ACKs; if it receives a majority, it becomes the new master.

New master broadcasts a PONG to inform the cluster.

The election requires at least three masters; with only two masters a majority cannot be reached.

Split‑Brain and Data Loss

If a network partition causes multiple masters to accept writes, data loss can occur when the partition heals. Setting

min-replicas-to-write 1

mitigates the risk but may affect availability.

<code>// minimum number of replicas that must acknowledge a write
min-replicas-to-write 1
</code>

Full Coverage

When

cluster-require-full-coverage

is set to

no

, the cluster remains available even if a master responsible for a slot goes down without a replica.

Batch Operations

Commands like

MSET

and

MGET

only work if all keys map to the same slot. Prefix keys with a hash tag (e.g.,

{user1}

) to force them into the same slot.

Example: mset {user1}:1:name zhangsan {user1}:1:age 18

Sentinel vs. Cluster Leader Election

Sentinel elects a leader when a master is marked down, using a similar majority‑vote mechanism based on Raft‑style epochs.

Cluster Fault Tolerance

Failure Detection

Nodes periodically send PING messages. If a node does not reply within the timeout, it is marked

PFAIL

. When a majority of masters report a node as

FAIL

, the node is considered down.

Failover Process

A slave of the failed master is selected.

The selected slave runs

SLAVEOF NO ONE

to become a master.

The new master takes over the slots of the failed node.

The new master broadcasts a PONG to inform the cluster.

Clients start sending commands to the new master.

Adding Nodes and Resharding

To expand the cluster, start new nodes and add them with

redis-cli --cluster add-node

. Then use

redis-cli --cluster reshard

to move slots.

<code># start new nodes
redis-server redis-7006.conf
redis-server redis-7007.conf
# add node 7006 as a master
redis-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7001
# reshard slots to the new master
redis-cli --cluster reshard 127.0.0.1:7001
</code>

After adding a slave, set its master with

CLUSTER REPLICATE

:

<code># on the slave (7007)
cluster replicate 2109c2832177e8514174c6ef8fefd681076e28df
</code>

Removing Nodes

Before removing a master, migrate its slots to other masters using

redis-cli --cluster del-node

after a reshard.

<code># delete node 7007 (example)
redis-cli --cluster del-node 127.0.0.1:7007 8d935918d877a63283e1f3a1b220cdc8cb73c414
</code>

References

《Redis 设计与实现》黄健宏

Why Redis uses 16384 slots

https://blog.csdn.net/wanderstarrysky/article/details/118157751

https://segmentfault.com/a/1190000038373546

Images sourced from the internet; please notify of any infringement.

ShardingRedisclusterDistributedfailoverGossipResharding
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.