Big Data 11 min read

Elasticsearch Write, Read, and Search Processes: Underlying Mechanisms and Lucene Inverted Index

This article explains Elasticsearch’s write, read, and search workflows, detailing the roles of coordinating nodes, primary and replica shards, refresh and commit cycles, translog handling, and the underlying Lucene inverted index mechanism.

Selected Java Interview Questions

Aug 20, 2020

Elasticsearch Write, Read, and Search Processes: Underlying Mechanisms and Lucene Inverted Index

Elasticsearch stores data by first writing it to an in‑memory buffer and a transaction log (translog). When the buffer reaches a threshold or after a default 1‑second interval, a refresh moves the data to the OS cache, making it searchable (near‑real‑time).

During a write request, the client contacts a coordinating node, which routes the document to the appropriate primary shard. The primary shard processes the request and replicates the data to its replica shards. Once all shards acknowledge, the coordinating node returns a success response.

For a read request, the client again contacts any node that becomes a coordinate node. The node hashes the doc id to determine the target shard and uses a round‑robin algorithm to select either the primary or a replica shard, achieving load‑balanced reads.

Search operations are performed by sending the query to all relevant primary or replica shards. Each shard returns matching doc id s (query phase). The coordinating node then fetches the full documents from the shards (fetch phase), merges, sorts, and paginates the results before returning them to the client.

Internally, Elasticsearch relies on Lucene. When a segment file is created (after each refresh), Lucene builds an inverted index that maps terms to document IDs, enabling fast full‑text search. Deletions generate a .del file marking documents as deleted; updates are implemented as a delete followed by an insert.

Periodically, Elasticsearch performs a commit (also called flush) which writes the in‑memory buffer to a new segment file, fsyncs the OS cache to disk, and clears the translog. A default automatic flush occurs every 30 minutes or when the translog grows large.

Because data is first visible after the refresh (default 1 second), Elasticsearch is considered near real‑time. However, up to 5 seconds of data may reside only in the buffer or translog OS cache and could be lost on a crash unless fsync is forced.

Example code snippet illustrating a simple full‑text search scenario:

java真好玩儿啊<br/>java好难学啊<br/>j2ee特别牛Copy to clipboardErrorCopied<br/>

Finally, the article provides a concrete example of an inverted index built from a small document set, showing how terms map to document IDs and how the index supports keyword queries such as Facebook.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

search engine Elasticsearch lucene inverted index near real-time Write Process

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.