Elasticsearch Distributed Consistency Analysis: Data Flow, PacificA Algorithm, Sequence Numbers and Checkpoints
This article provides a detailed examination of Elasticsearch's distributed consistency mechanisms, covering the shard write path, the PacificA replication algorithm, the role of SequenceNumber and Checkpoint, and a comparison of ES's implementation with the original algorithm, based on version 6.2.
Problem Background
Elasticsearch splits each index into multiple shards that are distributed across nodes, with each shard having a primary copy and one or more replica copies. Writes go to the primary first, which then replicates to replicas, while reads can be served by both primary and replicas.
Data Write Flow
The write process consists of checking the active shard count, writing to the primary, concurrently sending write requests to all replicas, and finally responding to the client after all replicas have replied or failed.
final String activeShardCountFailure = checkActiveShardCount(); primaryResult = primary.perform(request); performOnReplicas(replicaRequest, globalCheckpoint, replicationGroup.getRoutingTable()); private void decPendingAndFinishIfNeeded() { assert pendingActions.get() > 0 : "pending action count goes below 0 for request [" + request + "]"; if (pendingActions.decrementAndGet() == 0) { finish(); } } public void execute() throws Exception { final String activeShardCountFailure = checkActiveShardCount(); final ShardRouting primaryRouting = primary.routingEntry(); final ShardId primaryId = primaryRouting.shardId(); if (activeShardCountFailure != null) { finishAsFailed(new UnavailableShardsException(primaryId, "{} Timeout: [{}], request: [{}]", activeShardCountFailure, request.timeout(), request)); return; } totalShards.incrementAndGet(); pendingActions.incrementAndGet(); primaryResult = primary.perform(request); primary.updateLocalCheckpointForShard(primaryRouting.allocationId().getId(), primary.localCheckpoint()); final ReplicaRequest replicaRequest = primaryResult.replicaRequest(); if (replicaRequest != null) { if (logger.isTraceEnabled()) { logger.trace("[{}] op [{}] completed on primary for request [{}]", primaryId, opType, request); } final long globalCheckpoint = primary.globalCheckpoint(); final ReplicationGroup replicationGroup = primary.getReplicationGroup(); markUnavailableShardsAsStale(replicaRequest, replicationGroup.getInSyncAllocationIds(), replicationGroup.getRoutingTable()); performOnReplicas(replicaRequest, globalCheckpoint, replicationGroup.getRoutingTable()); } successfulShards.incrementAndGet(); decPendingAndFinishIfNeeded(); }Why Check Active Shard Count First?
The wait_for_active_shards index setting defines the minimum number of active shard copies required before a write is accepted; the default is 1 (only the primary). Setting it higher improves write reliability but does not guarantee that the write reaches that many replicas.
Why Wait for All Replicas Before Responding?
Earlier ES versions allowed asynchronous replication, but this risked data loss if the primary failed. Modern ES waits for all replicas to acknowledge before responding, which can increase latency but ensures stronger consistency.
Impact of Replica Write Failures
If a replica continuously fails, the primary reports the failure to the master, which removes the replica from the InSyncAllocationIds set, preventing it from serving reads. Until the metadata is updated, clients may still read stale data from that replica.
{ "_shards" : { "total" : 2, "failed" : 0, "successful" : 2 } }PacificA Algorithm Overview
PacificA, proposed by Microsoft Research, is a primary‑backup replication algorithm that provides strong consistency. ES’s replication model is based on this algorithm, using a master node as the configuration manager.
Key Terminology
Replica Group: a set of replicas with one primary.
Configuration: description of the replica group composition.
Configuration Version: monotonically increasing version number.
Configuration Manager: component that maintains configuration consistency.
Serial Number (sn): sequential identifier for each update.
Prepared List / Committed List: stages of an update’s lifecycle.
Primary Invariant
At any time, if a replica believes it is the primary, the configuration manager must also consider it the primary, and there can be at most one primary per replica group.
Query and Update Flow
Queries are sent only to the primary, which returns the latest committed data. Updates are assigned a serial number, added to the primary’s prepared list, propagated to replicas, and committed only after all replicas have prepared the update.
Committed Invariant
For any secondary, its committed list is a prefix of the primary’s committed list, and the primary’s committed list is a prefix of each secondary’s prepared list.
Reconfiguration Scenarios
Secondary failure: the primary removes the failed replica from the configuration.
Primary failure: a secondary obtains a lease and promotes itself to primary.
Adding a new node: the new node first becomes a secondary candidate, catches up via prepare requests, and is then added to the replica group.
SequenceNumber, Checkpoint and Fast Recovery
Each write receives a Term (configuration version) and a SequenceNumber (operation order). The primary tracks a GlobalCheckpoint (the highest sequence number replicated to all replicas) and each shard tracks a LocalCheckpoint (the highest sequence number processed locally). These checkpoints enable fast recovery by allowing a restarted replica to resume from the last known safe point.
Comparison Between ES and PacificA
Similarities
Separate handling of meta‑consistency (configuration manager vs. master).
Maintenance of a replica group (InSyncAllocationIds vs. Replica Group).
Use of sequence numbers for ordering.
Differences
ES does not fully guarantee meta‑consistency, affecting data consistency.
PacificA includes an explicit prepare phase; ES writes directly without it.
ES allows reads from any in‑sync replica, which may return stale data.
Conclusion
The article analyzes Elasticsearch’s data flow consistency, highlighting progress and remaining gaps relative to the PacificA algorithm. It concludes the series on ES distributed consistency, thanking readers and inviting further discussion.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.