Backend Development 20 min read

Elasticsearch Internals: Distributed Document Storage, Real‑time Search, and Translog Mechanics

This article explains the core Elasticsearch architecture—including shard routing, primary‑replica interaction, document CRUD workflows, multi‑document APIs, segment merging, translog durability, and storage file formats—providing a comprehensive view of how near‑real‑time search is achieved on large‑scale data.

Beike Product & Technology

Nov 23, 2018

Elasticsearch Internals: Distributed Document Storage, Real‑time Search, and Translog Mechanics

1 Distributed Document Storage

Elasticsearch stores documents in shards, the smallest Lucene index unit, which are distributed across nodes using a routing hash: shard = hash(routing) % number_of_primary_shards. The routing value defaults to the document _id but can be customized.

Primary shards handle write operations; once a write succeeds, the changes are replicated to replica shards. The coordination node routes client requests to the appropriate primary shard based on the document ID.

1.1 Write a Document

The client sends a write request to the master node, which acts as the coordinating node. The request is routed to the primary shard that owns the document, executed, and then propagated in parallel to replica shards. Success is reported back through the coordination node to the client.

1.2 Retrieve a Document

The client sends a get request to the master node, which forwards it to the shard containing the document. The shard returns the document to the coordinating node, which then returns it to the client. If the document is present on the primary but not yet replicated, the replica may report a miss while the primary returns the data.

1.3 Partial Update

Updates are performed by retrieving the _source JSON, modifying it, and re‑indexing the document on the primary shard. If a version conflict occurs, Elasticsearch retries up to retry_on_conflict times before aborting.

1.4 Multi‑Document (mget) Retrieval

The coordinating node splits an mget request into per‑shard sub‑requests, forwards them to the relevant primary or replica shards, aggregates the responses, and returns a single combined result to the client.

1.5 Bulk Operations

Bulk requests are broken down per shard, sent to the primary shards, executed sequentially, and each successful operation is replicated to the replicas. The coordinating node aggregates the responses and returns them to the client.

2 Using mget to Retrieve Multiple Documents

The client sends an mget request to the master node, which builds per‑shard requests and forwards them in parallel to the nodes holding the relevant primary or replica shards. Once all shard responses are collected, the coordinating node assembles the final response.

3 Using bulk to Modify Multiple Documents

The client sends a bulk request to the master node. The coordinating node creates per‑node bulk sub‑requests, forwards them to the nodes that host the primary shards, and the primary shards execute the operations in order, replicating each successful operation to their replicas.

4 Shard Internals

Shards are the smallest working units in Elasticsearch. They achieve near‑real‑time performance through immutable inverted indexes, in‑memory buffers, and a write‑ahead translog.

4.1 Inverted Index

Lucene builds an inverted index that maps each unique term to a list of documents containing that term, along with term frequencies, positions, and other statistics. This structure enables fast full‑text search and relevance scoring.

4.2 Immutability

Once written to disk, an inverted index segment is immutable. New documents are added by creating new segments, avoiding locks and allowing concurrent reads without contention.

4.3 Dynamic Index Updates

Updates are performed by writing new segments that contain the modified documents; during search, Elasticsearch merges segment results on the fly, effectively presenting a unified view.

5 Near‑Real‑Time Search

Newly indexed documents are first written to an in‑memory buffer, then flushed to a new segment visible to searches after a refresh. By default Elasticsearch refreshes every second, making the data effectively searchable in near real time. The refresh interval can be tuned via the refresh_interval setting.

6 Translog for Disk Synchronization

The transaction log (translog) records every write operation before it is committed to a segment. If a node crashes, Elasticsearch can replay the translog to recover uncommitted data, ensuring durability.

7 Translog Durability

By default the translog is fsynced every 5 seconds or after each write request. Since version 2.0, Elasticsearch forces a translog fsync on every index, bulk, delete, or update operation before acknowledging the request, improving data safety at the cost of some performance.

1 index.translog.flush_threshold_period
2 index.translog.flush_threshold_size
3 index.translog.flush_threshold_ops

8 Segment Merging

Elasticsearch periodically merges small segments into larger ones to reduce the number of open files and improve search performance. During merging, deleted documents are purged. The process runs in the background without interrupting indexing.

9 Storage Types and File Formats

Elasticsearch supports several Lucene directory implementations: fs (default, may use mmapfs), simplefs, niofs, and mmapfs. Each stores shard indexes differently, with trade‑offs in concurrency and memory usage.

Lucene index files have various extensions (e.g., .cfs, .fdt, .fdx) and are often combined into a single compound file for efficiency.

10 Files Generated at Elasticsearch Startup

Data is stored under the path configured by path.data, e.g., /data1/es/data/nodes/0. Important files include node.lock, global state files like global-15.st, and per‑index UUID directories containing shard data.

11 Lucene File Naming

Each shard directory contains files named with a common prefix (the segment name) and different extensions indicating their purpose (e.g., postings, term dictionary, stored fields). Compound files (.cfs) may bundle several of these together.

12 Summary

The article dissects the fundamental components of Elasticsearch that enable real‑time, large‑scale search: distributed shard routing, primary‑replica coordination, immutable segment architecture, translog durability, and efficient storage formats. Understanding these mechanisms helps engineers design and troubleshoot high‑performance search solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch lucene distributed storage Segment Merging translog real-time search

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.