Fundamentals 16 min read

Understanding the Raft Consensus Algorithm: Roles, Elections, Log Replication, and Split‑Brain Scenarios

This article explains the Raft consensus algorithm in detail, covering its three node roles, leader election process, state‑machine log replication, handling of leader failures, multiple candidates, and split‑brain situations, providing clear diagrams and step‑by‑step descriptions for distributed system fundamentals.

Code Ape Tech Column

Aug 26, 2022

Understanding the Raft Consensus Algorithm: Roles, Elections, Log Replication, and Split‑Brain Scenarios

Hello everyone, I'm Chen~

Raft is a simple and easy‑to‑understand distributed algorithm that solves consistency problems in distributed systems. Compared with the traditional Paxos algorithm, Raft decomposes many computation problems into simple, relatively independent sub‑problems while offering performance comparable to Multi‑Paxos. Below we use animated diagrams to illustrate Raft’s internal principles.

Previous article: Discussion on the Paxos distributed consistency algorithm

Raft Basics

Terminology

Raft defines three types of roles:

Leader: elected by the majority of nodes; only one leader exists at any time.

Candidate: when there is no leader, some nodes become candidates to compete for leadership.

Follower: the regular nodes that simply follow the leader.

Key concepts during elections:

Leader Election: the process of selecting a leader from candidates.

Term: a monotonically increasing number; each term triggers a new election.

Election Timeout: the timeout after which a follower, not receiving a leader’s heartbeat, starts a new election.

Role Transitions

The following diagram shows how nodes switch roles. The transitions are:

Follower → Candidate: when an election starts or the election timeout expires.

Candidate → Candidate: when the election timeout expires again or a new term begins.

Candidate → Leader: when the candidate obtains a majority of votes.

Candidate → Follower: when another node becomes leader or a new term starts.

Leader → Follower: when a leader discovers its term ID is smaller than that of another node and steps down.

Note: each case will be explained in detail later.

Election

Case 1: Leader Election

Each node has an "election timer" (its timeout). The timers are random within 150‑300 ms, so the probability of two nodes timing out simultaneously is low. The node whose timer expires first (e.g., node B) becomes a candidate.

Candidate B starts voting; followers A and C respond. When B receives a majority of votes, the election succeeds and B becomes the leader.

Heartbeat detection: the leader continuously sends heartbeat messages to followers. When followers receive the heartbeat, their election timers reset to zero, preventing new elections.

The heartbeat interval must be shorter than the election timeout; otherwise, followers may time out and trigger unnecessary elections.

Case 2: Leader Failure

If the leader (node B) crashes, followers A and C continue their election timers. When A’s timer expires first, it becomes a candidate and follows the same election flow: request votes → receive votes → become leader → start heartbeats.

Case 3: Multiple Candidates

When two candidates (A and D) appear simultaneously, both start voting. If one obtains a majority first, it becomes leader; if the vote counts are equal, a new round of voting is triggered.

Later, node C becomes the new candidate with term 5, initiates a new round, and after other nodes update their term values, C is elected the new leader.

Log Replication

State‑Machine Replication

The basic idea is a distributed state machine composed of multiple replicated units. Each unit stores its state in an operation log. The consistency module on each server receives external commands, appends them to its log, and communicates with other servers to ensure that all logs eventually contain the same commands in the same order. Once a command is correctly replicated, every server’s state machine processes it in log order and returns the result to the client.

Data Synchronization Process

When a client issues a data‑update request, it first reaches the leader (node C). The leader updates its log and notifies followers to do the same. After followers successfully append the entry, they acknowledge the leader, completing the “commit” phase. The leader then applies the entry to its local state machine, notifies followers to apply it as well, and finally returns success to the client. Subsequent client requests repeat this cycle.

Log Replication Principle

Each log entry typically contains three fields: an integer index (Log Index), a term number (Term), and a command (Command). The index indicates the entry’s position in the log file; the term helps detect inconsistencies across servers; the command is the external operation to be executed by the state machine.

A leader may consider an entry “committed” when it has been replicated on a majority of nodes. For example, entry 9 is replicated on 4 out of 7 nodes and is therefore committed, whereas entry 10 is only on 3 nodes and is not yet committed.

Usually the leader and followers keep identical logs. If a leader crashes before fully replicating its log, inconsistencies may arise. Raft forces followers to match the leader’s log; conflicting entries on a follower are overwritten by the leader’s entries. To do this, the leader tracks a nextIndex for each follower, representing the index of the next log entry to send.

When a leader is elected, it assumes all followers are up‑to‑date and initializes each follower’s nextIndex to lastLogIndex + 1. In the example, the leader’s last log index is 10, so nextIndex starts at 11. The leader sends an AppendEntries RPC containing the pair (item_id, nextIndex‑1). If a follower lacks the entry, it replies with failure, prompting the leader to decrement nextIndex and retry until the follower’s log matches. Once matched, the follower discards any entries after nextIndex and appends the leader’s newer entries.

Example: the leader’s nextIndex is 11 and it sends AppendEntries RPC(6,10) to follower b, which fails. The leader then retries with (6,9), (6,8), (5,7), (5,6), (4,5) and finally (4,4). After this, follower b deletes entries after index 4 and appends the leader’s entries starting from index 4.

Split‑Brain Situation

If a network partition creates two separate clusters, each may elect its own leader (a “double‑leader” scenario). The leader in the isolated partition cannot replicate its writes to a majority, so those writes never become committed (as shown in the fourth diagram where SET 3 is not committed).

When the network heals, the old leader discovers that the new leader’s term is higher, steps down to follower, and synchronizes its data from the new leader, restoring cluster consistency as described in the log‑replication section.