Backend Development 21 min read

How Raft Achieves Consensus: Leader Election, Log Replication, and State Machine Explained

This article explains the core mechanisms of the Raft consensus algorithm—including leader election, log replication, safety guarantees, message structures, state transitions, and key Go implementations in etcd-raft—providing code examples and detailed analysis of functions such as becomeLeader, tickElection, and appendEntry.

Ops Development Stories

Jun 16, 2021

How Raft Achieves Consensus: Leader Election, Log Replication, and State Machine Explained

In essence, the Raft algorithm achieves consensus by always following the leader, ensuring a series of values and node logs stay consistent.

Leader election: after a leader fails, the cluster quickly selects a new leader.

Log replication: only the leader can write logs; it replicates them to followers and forces followers to stay identical.

Safety: only one leader per term, committed entries survive leader changes, and all state machines apply identical log entries.

Leader Election

Raft treats every operation—election, commit, etc.—as a message struct fed into the Raft state machine. The etcd‑raft library defines this struct as follows:

type Message struct {
    Type MessageType `protobuf:"varint,1,opt,name=type,enum=raftpb.MessageType" json:"type"` // message type
    To uint64 `protobuf:"varint,2,opt,name=to" json:"to"` // receiver node ID
    From uint64 `protobuf:"varint,3,opt,name=from" json:"from"` // sender node ID
    Term uint64 `protobuf:"varint,4,opt,name=term" json:"term"` // sender's term (0 for local messages)
    LogTerm uint64 `protobuf:"varint,5,opt,name=logTerm" json:"logTerm"` // term of first entry in the message
    Index uint64 `protobuf:"varint,6,opt,name=index" json:"index"` // log index for follower reports
    Entries []Entry `protobuf:"bytes,7,rep,name=entries" json:"entries"` // entries to replicate (MsgApp)
    Commit uint64 `protobuf:"varint,8,opt,name=commit" json:"commit"` // committed log index
    Snapshot Snapshot `protobuf:"bytes,9,opt,name=snapshot" json:"snapshot"` // snapshot data
    Reject bool `protobuf:"varint,10,opt,name=reject" json:"reject"` // whether the message is rejected
    RejectHint uint64 `protobuf:"varint,11,opt,name=rejectHint" json:"rejectHint"` // hint for rejected messages
    Context []byte `protobuf:"bytes,12,opt,name=context" json:"context,omitempty"` // optional context information
    XXX_unrecognized []byte `json:"-"`
}

The MessageType enum contains 19 values, only a subset of which are used for a given message:

MsgHup = 0          // follower election timeout
MsgBeat = 1         // leader heartbeat
MsgProp = 2         // client write request
MsgApp = 3          // leader sends entries to follower
MsgAppResp = 4      // follower response to MsgApp
MsgVote = 5         // candidate requests vote
MsgVoteResp = 6     // vote response
MsgSnap = 7         // leader sends snapshot
MsgHeartbeat = 8    // leader heartbeat (alternative name)
MsgHeartbeatResp = 9 // follower heartbeat response
MsgUnreachable = 10 // message cannot be delivered
MsgSnapStatus = 11   // snapshot status after send failure
MsgCheckQuorum = 12  // leader checks quorum
MsgTransferLeader = 13 // leader transfer (local)
MsgTimeoutNow = 14   // force immediate election
MsgReadIndex = 15   // read‑only request
MsgReadIndexResp = 16 // read‑only response
MsgPreVote = 17      // pre‑candidate vote request
MsgPreVoteResp = 18  // pre‑candidate vote response

The node struct wraps the etcd‑raft library, exposing channels for different message categories:

type node struct {
    propc      chan msgWithResult   // MsgProp messages
    recvc      chan pb.Message      // all other messages
    confc      chan pb.ConfChangeV2 // configuration changes
    confstatec chan pb.ConfState    // current cluster configuration
    readyc     chan Ready           // Ready instances for the upper layer
    advancec   chan struct{}        // signals that Ready has been processed
    tickc      chan struct{}        // logical clock ticks
    done       chan struct{}        // closed when node stops
    stop       chan struct{}        // triggers node shutdown
    status     chan chan Status     // status queries
    rn         *RawNode
}

Raft nodes can be in three states: Candidate, Follower, or Leader. Each state has its own tick and step functions, and the state machine transitions are illustrated below:

etcd‑raft adds a PreCandidate state to avoid unnecessary elections when logs are far behind. The state‑transition functions are defined in raft.go:

func (r *raft) becomeFollower(term uint64, lead uint64) {
    r.step = stepFollower
    r.reset(term)
    r.tick = r.tickElection
    r.lead = lead
    r.state = StateFollower
}

func (r *raft) becomePreCandidate() {
    if r.state == StateLeader {
        panic("invalid transition [leader -> pre-candidate]")
    }
    r.step = stepCandidate
    r.prs.ResetVotes()
    r.tick = r.tickElection
    r.lead = None
    r.state = StatePreCandidate
}

func (r *raft) becomeCandidate() {
    if r.state == StateLeader {
        panic("invalid transition [leader -> candidate]")
    }
    r.step = stepCandidate
    r.reset(r.Term + 1)
    r.tick = r.tickElection
    r.Vote = r.id
    r.state = StateCandidate
}

func (r *raft) becomeLeader() {
    if r.state == StateFollower {
        panic("invalid transition [follower -> leader]")
    }
    r.step = stepLeader
    r.reset(r.Term)
    r.tick = r.tickHeartbeat
    r.lead = r.id
    r.state = StateLeader
    r.prs.Progress[r.id].BecomeReplicate()
    r.pendingConfIndex = r.raftLog.lastIndex()
    emptyEnt := pb.Entry{Data: nil}
    if !r.appendEntry(emptyEnt) {}
    r.reduceUncommittedSize([]pb.Entry{emptyEnt})
}

The tickElection and tickHeartbeat functions drive time‑based actions:

func (r *raft) tickElection() {
    r.electionElapsed++
    if r.promotable() && r.pastElectionTimeout() {
        r.electionElapsed = 0
        r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
    }
}

func (r *raft) tickHeartbeat() {
    r.heartbeatElapsed++
    r.electionElapsed++
    if r.electionElapsed >= r.electionTimeout {
        r.electionElapsed = 0
        if r.checkQuorum {
            r.Step(pb.Message{From: r.id, Type: pb.MsgCheckQuorum})
        }
        if r.state == StateLeader && r.leadTransferee != None {
            r.abortLeaderTransfer()
        }
    }
    if r.state != StateLeader {
        return
    }
    if r.heartbeatElapsed >= r.heartbeatTimeout {
        r.heartbeatElapsed = 0
        r.Step(pb.Message{From: r.id, Type: pb.MsgBeat})
    }
}

Log replication is performed by appendEntry, which sets term and index, appends to the raft log, updates the leader’s progress, and attempts to commit:

func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
    li := r.raftLog.lastIndex()
    for i := range es {
        es[i].Term = r.Term
        es[i].Index = li + 1 + uint64(i)
    }
    li = r.raftLog.append(es...)
    r.prs.Progress[r.id].MaybeUpdate(li)
    r.maybeCommit()
    return true
}

The Progress.MaybeUpdate method adjusts the follower’s match and next indices:

func (pr *Progress) MaybeUpdate(n uint64) bool {
    var updated bool
    if pr.Match < n {
        pr.Match = n
        updated = true
        pr.ProbeAcked()
    }
    pr.Next = max(pr.Next, n+1)
    return updated
}

Commit decisions are made in maybeCommit based on the majority of matched indices:

func (r *raft) maybeCommit() bool {
    mci := r.prs.Committed()
    return r.raftLog.maybeCommit(mci, r.Term)
}

All Raft messages are processed by the Step function, which first checks the term and then dispatches based on message type (e.g., MsgHup, MsgVote, MsgPreVote, etc.). The function may invoke state‑specific handlers such as hup, campaign, or the current step implementation.

func (r *raft) Step(m pb.Message) error {
    switch {
    case m.Term == 0:
        // local message, ignore
    case m.Term > r.Term:
        // handle higher term
    case m.Term < r.Term:
        // ignore stale term
    }
    switch m.Type {
    case pb.MsgHup:
        if r.preVote {
            // start pre‑vote
        } else {
            r.hup(campaignElection)
        }
    case pb.MsgVote, pb.MsgPreVote:
        canVote := r.Vote == m.From || (r.Vote == None && r.lead == None) || (m.Type == pb.MsgPreVote && m.Term > r.Term)
        if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
            r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
            if m.Type == pb.MsgVote {
                r.electionElapsed = 0
                r.Vote = m.From
            }
        } else {
            r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
        }
    default:
        if err := r.step(r, m); err != nil {
            return err
        }
    }
    return nil
}

When a follower’s election timer expires, it triggers tickElection, which creates a MsgHup that moves the node to PreCandidate, then sends MsgPreVote to the cluster.

Reference: GeeksTime etcd practical course and the book "etcd Inside".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Go Raft Consensus etcd

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.