How Raft Achieves Consensus: Leader Election, Log Replication, and State Machine Explained
This article explains the core mechanisms of the Raft consensus algorithm—including leader election, log replication, safety guarantees, message structures, state transitions, and key Go implementations in etcd-raft—providing code examples and detailed analysis of functions such as becomeLeader, tickElection, and appendEntry.
In essence, the Raft algorithm achieves consensus by always following the leader, ensuring a series of values and node logs stay consistent.
Leader election: after a leader fails, the cluster quickly selects a new leader.
Log replication: only the leader can write logs; it replicates them to followers and forces followers to stay identical.
Safety: only one leader per term, committed entries survive leader changes, and all state machines apply identical log entries.
Leader Election
Raft treats every operation—election, commit, etc.—as a message struct fed into the Raft state machine. The etcd‑raft library defines this struct as follows:
<code>type Message struct {
Type MessageType `protobuf:"varint,1,opt,name=type,enum=raftpb.MessageType" json:"type"` // message type
To uint64 `protobuf:"varint,2,opt,name=to" json:"to"` // receiver node ID
From uint64 `protobuf:"varint,3,opt,name=from" json:"from"` // sender node ID
Term uint64 `protobuf:"varint,4,opt,name=term" json:"term"` // sender's term (0 for local messages)
LogTerm uint64 `protobuf:"varint,5,opt,name=logTerm" json:"logTerm"` // term of first entry in the message
Index uint64 `protobuf:"varint,6,opt,name=index" json:"index"` // log index for follower reports
Entries []Entry `protobuf:"bytes,7,rep,name=entries" json:"entries"` // entries to replicate (MsgApp)
Commit uint64 `protobuf:"varint,8,opt,name=commit" json:"commit"` // committed log index
Snapshot Snapshot `protobuf:"bytes,9,opt,name=snapshot" json:"snapshot"` // snapshot data
Reject bool `protobuf:"varint,10,opt,name=reject" json:"reject"` // whether the message is rejected
RejectHint uint64 `protobuf:"varint,11,opt,name=rejectHint" json:"rejectHint"` // hint for rejected messages
Context []byte `protobuf:"bytes,12,opt,name=context" json:"context,omitempty"` // optional context information
XXX_unrecognized []byte `json:"-"`
}
</code>The
MessageTypeenum contains 19 values, only a subset of which are used for a given message:
<code>MsgHup = 0 // follower election timeout
MsgBeat = 1 // leader heartbeat
MsgProp = 2 // client write request
MsgApp = 3 // leader sends entries to follower
MsgAppResp = 4 // follower response to MsgApp
MsgVote = 5 // candidate requests vote
MsgVoteResp = 6 // vote response
MsgSnap = 7 // leader sends snapshot
MsgHeartbeat = 8 // leader heartbeat (alternative name)
MsgHeartbeatResp = 9 // follower heartbeat response
MsgUnreachable = 10 // message cannot be delivered
MsgSnapStatus = 11 // snapshot status after send failure
MsgCheckQuorum = 12 // leader checks quorum
MsgTransferLeader = 13 // leader transfer (local)
MsgTimeoutNow = 14 // force immediate election
MsgReadIndex = 15 // read‑only request
MsgReadIndexResp = 16 // read‑only response
MsgPreVote = 17 // pre‑candidate vote request
MsgPreVoteResp = 18 // pre‑candidate vote response
</code>The
nodestruct wraps the etcd‑raft library, exposing channels for different message categories:
<code>type node struct {
propc chan msgWithResult // MsgProp messages
recvc chan pb.Message // all other messages
confc chan pb.ConfChangeV2 // configuration changes
confstatec chan pb.ConfState // current cluster configuration
readyc chan Ready // Ready instances for the upper layer
advancec chan struct{} // signals that Ready has been processed
tickc chan struct{} // logical clock ticks
done chan struct{} // closed when node stops
stop chan struct{} // triggers node shutdown
status chan chan Status // status queries
rn *RawNode
}
</code>Raft nodes can be in three states: Candidate, Follower, or Leader. Each state has its own tick and step functions, and the state machine transitions are illustrated below:
etcd‑raft adds a PreCandidate state to avoid unnecessary elections when logs are far behind. The state‑transition functions are defined in
raft.go:
<code>func (r *raft) becomeFollower(term uint64, lead uint64) {
r.step = stepFollower
r.reset(term)
r.tick = r.tickElection
r.lead = lead
r.state = StateFollower
}
func (r *raft) becomePreCandidate() {
if r.state == StateLeader {
panic("invalid transition [leader -> pre-candidate]")
}
r.step = stepCandidate
r.prs.ResetVotes()
r.tick = r.tickElection
r.lead = None
r.state = StatePreCandidate
}
func (r *raft) becomeCandidate() {
if r.state == StateLeader {
panic("invalid transition [leader -> candidate]")
}
r.step = stepCandidate
r.reset(r.Term + 1)
r.tick = r.tickElection
r.Vote = r.id
r.state = StateCandidate
}
func (r *raft) becomeLeader() {
if r.state == StateFollower {
panic("invalid transition [follower -> leader]")
}
r.step = stepLeader
r.reset(r.Term)
r.tick = r.tickHeartbeat
r.lead = r.id
r.state = StateLeader
r.prs.Progress[r.id].BecomeReplicate()
r.pendingConfIndex = r.raftLog.lastIndex()
emptyEnt := pb.Entry{Data: nil}
if !r.appendEntry(emptyEnt) {}
r.reduceUncommittedSize([]pb.Entry{emptyEnt})
}
</code>The
tickElectionand
tickHeartbeatfunctions drive time‑based actions:
<code>func (r *raft) tickElection() {
r.electionElapsed++
if r.promotable() && r.pastElectionTimeout() {
r.electionElapsed = 0
r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
}
}
func (r *raft) tickHeartbeat() {
r.heartbeatElapsed++
r.electionElapsed++
if r.electionElapsed >= r.electionTimeout {
r.electionElapsed = 0
if r.checkQuorum {
r.Step(pb.Message{From: r.id, Type: pb.MsgCheckQuorum})
}
if r.state == StateLeader && r.leadTransferee != None {
r.abortLeaderTransfer()
}
}
if r.state != StateLeader {
return
}
if r.heartbeatElapsed >= r.heartbeatTimeout {
r.heartbeatElapsed = 0
r.Step(pb.Message{From: r.id, Type: pb.MsgBeat})
}
}
</code>Log replication is performed by
appendEntry, which sets term and index, appends to the raft log, updates the leader’s progress, and attempts to commit:
<code>func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
li := r.raftLog.lastIndex()
for i := range es {
es[i].Term = r.Term
es[i].Index = li + 1 + uint64(i)
}
li = r.raftLog.append(es...)
r.prs.Progress[r.id].MaybeUpdate(li)
r.maybeCommit()
return true
}
</code>The
Progress.MaybeUpdatemethod adjusts the follower’s match and next indices:
<code>func (pr *Progress) MaybeUpdate(n uint64) bool {
var updated bool
if pr.Match < n {
pr.Match = n
updated = true
pr.ProbeAcked()
}
pr.Next = max(pr.Next, n+1)
return updated
}
</code>Commit decisions are made in
maybeCommitbased on the majority of matched indices:
<code>func (r *raft) maybeCommit() bool {
mci := r.prs.Committed()
return r.raftLog.maybeCommit(mci, r.Term)
}
</code>All Raft messages are processed by the
Stepfunction, which first checks the term and then dispatches based on message type (e.g.,
MsgHup,
MsgVote,
MsgPreVote, etc.). The function may invoke state‑specific handlers such as
hup,
campaign, or the current
stepimplementation.
<code>func (r *raft) Step(m pb.Message) error {
switch {
case m.Term == 0:
// local message, ignore
case m.Term > r.Term:
// handle higher term
case m.Term < r.Term:
// ignore stale term
}
switch m.Type {
case pb.MsgHup:
if r.preVote {
// start pre‑vote
} else {
r.hup(campaignElection)
}
case pb.MsgVote, pb.MsgPreVote:
canVote := r.Vote == m.From || (r.Vote == None && r.lead == None) || (m.Type == pb.MsgPreVote && m.Term > r.Term)
if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
if m.Type == pb.MsgVote {
r.electionElapsed = 0
r.Vote = m.From
}
} else {
r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
}
default:
if err := r.step(r, m); err != nil {
return err
}
}
return nil
}
</code>When a follower’s election timer expires, it triggers
tickElection, which creates a
MsgHupthat moves the node to
PreCandidate, then sends
MsgPreVoteto the cluster.
Reference: GeeksTime etcd practical course and the book "etcd Inside".
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.