Databases 15 min read

Root Cause Analysis of a Redis Cluster Slot‑Migration Failure and Gossip‑Protocol Inconsistencies

This article analyzes a Redis cluster outage caused by a slot‑migration bug where a node simultaneously migrated slots in and out, leading to conflicting config epochs, gossip‑protocol mismatches, and MOVED errors, and provides detailed troubleshooting steps and preventive measures.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Root Cause Analysis of a Redis Cluster Slot‑Migration Failure and Gossip‑Protocol Inconsistencies

Background

This post documents a second large‑scale Redis outage triggered during a cluster shrink operation. After a previous incident caused by a faulty shrink, the author experienced another failure, this time also related to slot reduction.

Fault Scene

During the shrink, while moving slot data, the service suddenly returned many MOVED errors, indicating that the cluster nodes disagreed on slot ownership.

Tip: Each Redis node stores the owner of every slot. When a client contacts a node, the node calculates the slot from the key and, if it does not own the slot, returns a MOVED slot ip response so the client can retry the correct node.

The specific error observed was MOVED 10.20.22.92 7940 . Checking the cluster with cluster nodes on the reported IP showed that slot 7940 was indeed not present on that node, while other nodes claimed ownership, revealing an inconsistent view of the slot distribution.

root@sh2-arch-redis-product-prod-29 ~ $ redis-cli cluster nodes|grep self
7285477753679199e9238fbe94a00f1569661aea 10.20.22.92:6379@16379 myself,master - 0 1721140407000 35464 connected 7866-7929 ...

Further cluster nodes queries on other machines confirmed that slot 7940 was listed as belonging to 10.20.22.92 , creating a split‑brain situation.

root@sh1-arch-redis-product-1 ~ $ redis-cli cluster nodes | grep 10.20.22.92
7285477753679199e9238fbe94a00f1569661aea 10.20.22 92:6379@16379 master - 0 1721140402000 35464 connected 7866-7942 ...

Deep Dive into Redis Cluster Internal Communication

Redis nodes exchange state via a gossip protocol built on PING‑PONG messages. Each gossip packet carries two main pieces of information:

The slots the sender is responsible for and its config epoch (a version number for slot ownership).

A subset of known peer IPs and ports.

Every second a node selects the longest‑unpinged peer and sends a PING . It also ensures that within cluster‑node‑timeout/2 = 7500ms it pings any peer it has not contacted yet. In large clusters this mechanism spreads slot ownership updates gradually.

💡 Knowledge 1: Slot changes propagate over time in big clusters.

How Nodes Process Gossip Messages

When node B receives a gossip packet from node A, it follows these steps:

If the packet’s config epoch is larger than the stored epoch for A, B updates A’s epoch.

If the epochs are equal and A’s node‑ID is smaller, B increments its own currentEpoch and adopts the larger epoch as its configEpoch .

If A reports a slot owner that differs from B’s view, B compares the epochs: If A’s epoch for that slot is higher, B updates the slot owner. If lower, B sends an UPDATE message containing the correct owner and epoch.

If B later receives an UPDATE , it applies the supplied ownership information.

💡 Knowledge 2: When two nodes claim the same slot, the one with the higher config epoch wins, and the loser must update its view.

A subtle bug appears when a node migrates a slot out while simultaneously receiving a new slot assignment. The gossip protocol may still advertise the old ownership until the “new owner” sends its own ping, causing temporary split‑brain states.

💡 Knowledge 3: After a node migrates a slot away, other nodes only update that slot’s owner when the new owner’s gossip message arrives.

What Happens During Slot Migration

When moving a slot, the source node first transfers the data, then executes cluster setslot <slot> <nodeId> . The command performs two actions:

Sets the local slot owner to the target node.

If the node is still importing the slot and its configEpoch is not the highest, it increments currentEpoch and adopts the new epoch as its configEpoch .

The following log excerpt shows the epoch update after a successful import:

... 
1803:M 16 Jul 2024 22:23:17.617 # configEpoch updated after importing slot 7940
1803:M 16 Jul 2024 22:23:21.088 # New configEpoch set to 35279
...
💡 Knowledge 4: When a node imports a slot, its config epoch usually becomes the highest in the cluster.

Root Cause Summary

The failure was caused by a node being used as both source and destination in the migration plan. While it was migrating slots out, it also received a new slot assignment, causing its epoch to jump higher than the target node’s. Consequently, the target’s slot ownership was overwritten, leading to the massive MOVED errors.

How to Fix and Prevent

After executing setslot , wait for the topology changes to fully propagate before proceeding with additional migrations.

When a source node no longer claims certain slots, mark those slots as “unassigned” so that the target’s gossip messages can safely update ownership without being rejected due to a lower epoch. A related PR has been merged into Redis 7.0 (see https://github.com/redis/redis/pull/12344).

Operationally, forbid a node from being both a source and a destination in the same migration batch.

Conclusion

The analysis demonstrates how a seemingly harmless planning mistake can trigger a cluster‑wide outage due to the intricacies of Redis’s gossip‑based slot propagation and epoch handling. Proper sequencing of migration steps and awareness of epoch dynamics are essential to avoid similar incidents.

RedisclusterTroubleshootingSlot Migrationgossip protocolConfig Epoch
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.