Understanding PFAIL and FAIL States in Redis Cluster Node Failure Detection
This article explains the PFAIL (possible fail) and FAIL (failed) states in Redis clusters, describes the state transition process, demonstrates node failure and automatic failover with command‑line examples, and provides practical insights into cluster health monitoring and recovery.
In a Redis cluster, node failure detection uses two core states: PFAIL (Possible Fail) indicating a node may be down, and FAIL indicating the node is confirmed down.
PFAIL (Possible Fail) : Also called subjective offline. If a node (e.g., Node A) does not receive a response (PING/PONG) from another node (e.g., Node B) within the configured cluster-node-timeout (default 15 seconds), Node A marks Node B as PFAIL, reflecting only Node A’s perspective.
FAIL (Failed) : Also called objective offline. When a majority of master nodes (more than half) also see the same node in PFAIL, the node is upgraded to FAIL. At least N/2 + 1 masters must independently mark the node as PFAIL before this promotion.
State transition workflow:
1. Node B fails : Network issues or a crash prevent it from responding to heartbeats.
2. PFAIL marking : Node A detects the timeout and marks Node B as PFAIL , propagating this via the Gossip protocol.
3. Other nodes verify : Additional masters (e.g., C, D) receive the PFAIL status, attempt communication with Node B, and if they also timeout, they also mark Node B as PFAIL .
4. FAIL state achieved : Once a majority of masters have marked Node B as PFAIL , the cluster upgrades the status to FAIL .
5. Failover : If Node B was a master, one of its slaves is elected as the new master and takes over the hash slots previously owned by Node B.
Below is an example of checking the cluster status before any failure:
70400abbd555df3bc9615140eba3e2787182b94f 127.0.0.1:7002@17002 master - 0 1747295427508 3 connected 10923-16383
7e48838283b2fbde812630375b6a644ab2ea697b 127.0.0.1:7005@17005 slave 70400abbd555df3bc9615140eba3e2787182b94f 0 1747295427508 3 connected
de29900beed26aef053c5dacd42ebdc99adbfb8b 127.0.0.1:7001@17001 slave 249c320418481e9dcdc03d0bff5fdf6270d3abcb 0 1747295427000 7 connected
b44dd7a366c9f6a5526de6daaaeb6d5ef298f95f 127.0.0.1:7000@17000 myself,master - 0 1747295427000 1 connected 0-5460
249c320418481e9dcdc03d0bff5fdf6270d3abcb 127.0.0.1:7004@17004 master - 0 1747295428017 7 connected 5461-10922
bff364f29817846d88cf6baeeb77b88588719d9c 127.0.0.1:7003@17003 slave b44dd7a366c9f6a5526de6daaaeb6d5ef298f95f 0 1747295426995 1 connectedWe then simulate a failure of master node 7002 by connecting with redis-cli and issuing shutdown . After the default 15‑second timeout, the cluster marks the node as PFAIL (displayed as fail? in the CLI):
70400abbd555df3bc9615140eba3e2787182b94f 127.0.0.1:7002@17002 master,fail? - 1747295482733 1747295480171 3 disconnected 10923-16383Moments later the status changes to FAIL, confirming the node is down. The slave of the failed master (node 7005) is promoted to master and takes over the hash slots 10923‑16383.
This demonstration illustrates how Redis clusters detect node failures, transition states from PFAIL to FAIL, and automatically perform failover to maintain availability.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.