Understanding the Raft Consensus Algorithm: Roles, Elections, Log Replication, and Split‑Brain Scenarios
This article explains the Raft consensus algorithm in detail, covering its three node roles, leader election process, state‑machine log replication, handling of leader failures, multiple candidates, and split‑brain situations, providing clear diagrams and step‑by‑step descriptions for distributed system fundamentals.
Hello everyone, I'm Chen~
Raft is a simple and easy‑to‑understand distributed algorithm that solves consistency problems in distributed systems. Compared with the traditional Paxos algorithm, Raft decomposes many computation problems into simple, relatively independent sub‑problems while offering performance comparable to Multi‑Paxos. Below we use animated diagrams to illustrate Raft’s internal principles.
Previous article: Discussion on the Paxos distributed consistency algorithm
Raft Basics
Terminology
Raft defines three types of roles:
Leader: elected by the majority of nodes; only one leader exists at any time.
Candidate: when there is no leader, some nodes become candidates to compete for leadership.
Follower: the regular nodes that simply follow the leader.
Key concepts during elections:
Leader Election: the process of selecting a leader from candidates.
Term: a monotonically increasing number; each term triggers a new election.
Election Timeout: the timeout after which a follower, not receiving a leader’s heartbeat, starts a new election.
Role Transitions
The following diagram shows how nodes switch roles. The transitions are:
Follower → Candidate: when an election starts or the election timeout expires.
Candidate → Candidate: when the election timeout expires again or a new term begins.
Candidate → Leader: when the candidate obtains a majority of votes.
Candidate → Follower: when another node becomes leader or a new term starts.
Leader → Follower: when a leader discovers its term ID is smaller than that of another node and steps down.
Note: each case will be explained in detail later.
Election
Case 1: Leader Election
Each node has an "election timer" (its timeout). The timers are random within 150‑300 ms, so the probability of two nodes timing out simultaneously is low. The node whose timer expires first (e.g., node B) becomes a candidate.
Candidate B starts voting; followers A and C respond. When B receives a majority of votes, the election succeeds and B becomes the leader.
Heartbeat detection: the leader continuously sends heartbeat messages to followers. When followers receive the heartbeat, their election timers reset to zero, preventing new elections.
The heartbeat interval must be shorter than the election timeout; otherwise, followers may time out and trigger unnecessary elections.
Case 2: Leader Failure
If the leader (node B) crashes, followers A and C continue their election timers. When A’s timer expires first, it becomes a candidate and follows the same election flow: request votes → receive votes → become leader → start heartbeats.
Case 3: Multiple Candidates
When two candidates (A and D) appear simultaneously, both start voting. If one obtains a majority first, it becomes leader; if the vote counts are equal, a new round of voting is triggered.
Later, node C becomes the new candidate with term 5, initiates a new round, and after other nodes update their term values, C is elected the new leader.
Log Replication
State‑Machine Replication
The basic idea is a distributed state machine composed of multiple replicated units. Each unit stores its state in an operation log. The consistency module on each server receives external commands, appends them to its log, and communicates with other servers to ensure that all logs eventually contain the same commands in the same order. Once a command is correctly replicated, every server’s state machine processes it in log order and returns the result to the client.
Data Synchronization Process
When a client issues a data‑update request, it first reaches the leader (node C). The leader updates its log and notifies followers to do the same. After followers successfully append the entry, they acknowledge the leader, completing the “commit” phase. The leader then applies the entry to its local state machine, notifies followers to apply it as well, and finally returns success to the client. Subsequent client requests repeat this cycle.
Log Replication Principle
Each log entry typically contains three fields: an integer index (Log Index), a term number (Term), and a command (Command). The index indicates the entry’s position in the log file; the term helps detect inconsistencies across servers; the command is the external operation to be executed by the state machine.
A leader may consider an entry “committed” when it has been replicated on a majority of nodes. For example, entry 9 is replicated on 4 out of 7 nodes and is therefore committed, whereas entry 10 is only on 3 nodes and is not yet committed.
Usually the leader and followers keep identical logs. If a leader crashes before fully replicating its log, inconsistencies may arise. Raft forces followers to match the leader’s log; conflicting entries on a follower are overwritten by the leader’s entries. To do this, the leader tracks a nextIndex for each follower, representing the index of the next log entry to send.
When a leader is elected, it assumes all followers are up‑to‑date and initializes each follower’s nextIndex to lastLogIndex + 1 . In the example, the leader’s last log index is 10, so nextIndex starts at 11. The leader sends an AppendEntries RPC containing the pair (item_id, nextIndex‑1). If a follower lacks the entry, it replies with failure, prompting the leader to decrement nextIndex and retry until the follower’s log matches. Once matched, the follower discards any entries after nextIndex and appends the leader’s newer entries.
Example: the leader’s nextIndex is 11 and it sends AppendEntries RPC(6,10) to follower b, which fails. The leader then retries with (6,9), (6,8), (5,7), (5,6), (4,5) and finally (4,4). After this, follower b deletes entries after index 4 and appends the leader’s entries starting from index 4.
Split‑Brain Situation
If a network partition creates two separate clusters, each may elect its own leader (a “double‑leader” scenario). The leader in the isolated partition cannot replicate its writes to a majority, so those writes never become committed (as shown in the fourth diagram where SET 3 is not committed).
When the network heals, the old leader discovers that the new leader’s term is higher, steps down to follower, and synchronizes its data from the new leader, restoring cluster consistency as described in the log‑replication section.
Recommended Reading (Please follow, no free riding!)
Two weeks of collected big‑company interview experiences – a must‑read for job‑hoppers after the holiday
Alibaba final interview: differences between OAuth2.0 and Single Sign‑On
Practical guide: integrating Spring Cloud Gateway with OAuth2.0 for distributed authentication
Why is Nacos so powerful from an implementation perspective?
Alibaba rate‑limiting tool Sentinel – 17 tough questions
OpenFeign – 9 painful questions
Spring Cloud Gateway – 10 tough questions
Link tracing with SkyWalking – why it feels so good
Everyone is liking, sharing, and giving the triple‑click
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.