Operations 14 min read

Master Election in Elasticsearch Pre‑7.x: Bully Algorithm and Source Code Analysis

This article explains the Bully election algorithm used by Elasticsearch versions prior to 7.x, describes phenomena such as false death and split‑brain, and provides a detailed walkthrough of the master‑election source code including key Java methods and configuration parameters.

政采云技术
政采云技术
政采云技术
Master Election in Elasticsearch Pre‑7.x: Bully Algorithm and Source Code Analysis

In earlier Elasticsearch series we learned that the master node manages cluster state, shards, and indices, and only nodes with node.master=true can be elected as master. Before version 7.x Elasticsearch used the Bully algorithm for master election, while later versions switched to Raft.

Bully Algorithm

The Bully algorithm requires every participating node to have a unique ID and selects the alive node with the highest ID as the master. The election process involves three message types: Election (initiate election), Answer/Alive (response), and Coordinator/Victory (announce winner). Elections are triggered when a process recovers from an error or when the current leader fails.

The election steps are:

If the node has the highest ID, it broadcasts a Victory message; otherwise it sends Election messages to nodes with larger IDs.

If no Alive response is received, the node declares itself master with a Victory message.

If an Alive message from a higher‑ID node is received, the node stops sending messages and waits for a Victory.

If an Election message from a lower‑ID node arrives, the node replies with Alive and restarts the election.

Upon receiving a Victory message, the node accepts the sender as the leader.

False Death

When a minority of nodes think the master is down (e.g., only one node cannot ping the master while others can), a "false death" occurs, potentially causing frequent elections. Elasticsearch mitigates this by requiring more than half of the nodes to agree that the master is down before starting a new election.

Split‑Brain

Network partitions can cause each partition to elect its own master, leading to multiple masters. Elasticsearch resolves this by requiring a majority (⌊N/2⌋+1) of nodes to acknowledge a master before it is accepted.

Elasticsearch Master Election Source Code Analysis

Master election is triggered during node startup/shutdown, periodic master ping failures, and when other nodes cannot ping the master.

Main Process

The core logic resides in the ZenDiscovery module, specifically the innerJoinCluster method:

private void innerJoinCluster() {
    DiscoveryNode masterNode = null;
    final Thread currentThread = Thread.currentThread();
    // start a new election round and accept votes
    nodeJoinController.startElectionContext();
    // loop until a temporary master is found
    while (masterNode == null && joinThreadControl.joinThreadActive(currentThread)) {
        masterNode = findMaster();
    }
    if (transportService.getLocalNode().equals(masterNode)) {
        // if this node becomes temporary master, wait for other nodes to confirm
        final int requiredJoins = Math.max(0, electMaster.minimumMasterNodes() - 1);
        nodeJoinController.waitToBeElectedAsMaster(requiredJoins, masterElectionWaitForJoinsTimeout,
            new NodeJoinController.ElectionCallback() {
                @Override
                public void onElectedAsMaster(ClusterState state) {
                    // election succeeded, finish this round
                    synchronized (stateMutex) {
                        joinThreadControl.markThreadAsDone(currentThread);
                    }
                }
                @Override
                public void onFailure(Throwable t) {
                    // election failed, start a new round
                    synchronized (stateMutex) {
                        joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    }
                }
            }
        );
    } else {
        // not elected as master, stop accepting votes and vote for the elected master
        nodeJoinController.stopElectionContext(masterNode + " elected");
        final boolean success = joinElectedMaster(masterNode);
        synchronized (stateMutex) {
            if (success) {
                DiscoveryNode currentMasterNode = this.clusterState().getNodes().getMasterNode();
                if (currentMasterNode == null) {
                    // still no master, start a new election
                    joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                } else if (!currentMasterNode.equals(masterNode)) {
                    // master switched, re‑join
                    joinThreadControl.stopRunningThreadAndRejoin("master_switched_while_finalizing_join");
                }
                joinThreadControl.markThreadAsDone(currentThread);
            } else {
                // voting failed, start a new election
                joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
            }
        }
    }
}

The configuration discovery.zen.minimum_master_nodes (usually nodeCount/2 + 1 ) prevents split‑brain.

Waiting for Other Nodes to Confirm

public void waitToBeElectedAsMaster(int requiredMasterJoins, TimeValue timeValue, final ElectionCallback callback) {
    final CountDownLatch done = new CountDownLatch(1);
    final ElectionCallback wrapperCallback = new ElectionCallback() {
        @Override
        public void onElectedAsMaster(ClusterState state) {
            done.countDown();
            callback.onElectedAsMaster(state);
        }
        @Override
        public void onFailure(Throwable t) {
            done.countDown();
            callback.onFailure(t);
        }
    };
    // set minimum votes and start election
    electionContext.onAttemptToBeElected(requiredMasterJoins, wrapperCallback);
    checkPendingJoinsAndElectIfNeeded();
    if (done.await(timeValue.millis(), TimeUnit.MILLISECONDS)) {
        return;
    }
    // timeout handling ...
}

Selecting a Temporary Master

The findMaster method pings other nodes, collects their view of the current master, and decides the master based on the gathered information:

private DiscoveryNode findMaster() {
    List
fullPingResponses = pingAndWait(pingTimeout).toList();
    final DiscoveryNode localNode = transportService.getLocalNode();
    fullPingResponses.add(new ZenPing.PingResponse(localNode, null, this.clusterState()));
    final List
pingResponses = filterPingResponses(fullPingResponses, masterElectionIgnoreNonMasters, logger);
    List
activeMasters = new ArrayList<>();
    for (ZenPing.PingResponse pingResponse : pingResponses) {
        if (pingResponse.master() != null && !localNode.equals(pingResponse.master())) {
            activeMasters.add(pingResponse.master());
        }
    }
    List
masterCandidates = new ArrayList<>();
    for (ZenPing.PingResponse pingResponse : pingResponses) {
        if (pingResponse.node().isMasterNode()) {
            masterCandidates.add(new ElectMasterService.MasterCandidate(pingResponse.node(), pingResponse.getClusterStateVersion()));
        }
    }
    if (activeMasters.isEmpty()) {
        if (electMaster.hasEnoughCandidates(masterCandidates)) {
            ElectMasterService.MasterCandidate winner = electMaster.electMaster(masterCandidates);
            return winner.getNode();
        } else {
            return null; // election failed
        }
    } else {
        return electMaster.tieBreakActiveMasters(activeMasters);
    }
}

When no active master exists, the algorithm elects the node with the highest cluster state version and, if tied, the smallest node ID.

Ping Mechanism

The pingAndWait method blocks until the configured ping timeout expires, collecting ping responses via a CompletableFuture :

private ZenPing.PingCollection pingAndWait(TimeValue timeout) {
    final CompletableFuture
response = new CompletableFuture<>();
    try {
        zenPing.ping(response::complete, timeout);
    } catch (Exception ex) {
        response.completeExceptionally(ex);
    }
    return response.get();
}

Unicast pings are sent concurrently using the unicastZenPingExecutorService . The number of concurrent connections is controlled by discovery.zen.ping.unicast.concurrent_connects , and the overall ping timeout by discovery.zen.ping_timeout .

Choosing the Master

After gathering ping results, the algorithm either elects a master from the qualified candidates when the cluster has no master, or breaks ties among active masters by selecting the node with the smallest ID:

public MasterCandidate electMaster(Collection
candidates) {
    assert hasEnoughCandidates(candidates);
    List
sortedCandidates = new ArrayList<>(candidates);
    sortedCandidates.sort(MasterCandidate::compare);
    return sortedCandidates.get(0);
}
public static int compare(MasterCandidate c1, MasterCandidate c2) {
    int ret = Long.compare(c2.clusterStateVersion, c1.clusterStateVersion);
    if (ret == 0) {
        ret = compareNodes(c1.getNode(), c2.getNode());
    }
    return ret;
}
private static int compareNodes(DiscoveryNode o1, DiscoveryNode o2) {
    if (o1.isMasterNode() && !o2.isMasterNode()) return -1;
    if (!o1.isMasterNode() && o2.isMasterNode()) return 1;
    return o1.getId().compareTo(o2.getId());
}

Thus, the master election speed depends largely on the duration of the pingAndWait call, which is bounded by the configured ping timeout.

References

https://zhuanlan.zhihu.com/p/110015509

https://baike.baidu.com/item/霸道选举算法

https://github.com/elastic/elasticsearch

distributed systemsJavaElasticsearchBully algorithmMaster Election
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.