Big Data 11 min read

Elasticsearch Write, Read, and Search Processes: Underlying Mechanisms and Lucene Inverted Index

This article explains how Elasticsearch handles data ingestion, retrieval, and full‑text search by describing the roles of coordinating, primary, and replica nodes, the refresh‑commit‑flush cycle, segment files, translog, and the Lucene‑based inverted index that powers its near‑real‑time capabilities.

Architecture Digest
Architecture Digest
Architecture Digest
Elasticsearch Write, Read, and Search Processes: Underlying Mechanisms and Lucene Inverted Index

Elasticsearch (ES) is a distributed search engine built on Lucene; interview questions often probe whether candidates understand its core write, read, and search mechanisms rather than just using its APIs.

Write Process

The client sends a request to a coordinating node , which routes the document to the appropriate node holding the primary shard.

The primary shard processes the request and replicates the data to replica nodes .

After the primary and all replicas acknowledge, the coordinating node returns a response to the client.

Read Process

The client can contact any node, which becomes a coordinate node .

The coordinate node hashes the doc id and uses a round‑robin algorithm to forward the request to a randomly chosen primary or replica shard for load balancing.

The selected shard returns the document to the coordinate node, which then forwards it to the client.

Search Process

A client query is sent to a coordinate node , which forwards it to all relevant primary or replica shards.

During the query phase, each shard returns matching doc id s to the coordinating node.

In the fetch phase, the coordinating node retrieves the actual documents from the shards and returns the final result set.

Write requests go to the primary shard and are synchronized to replicas; read requests may hit any shard using round‑robin load balancing.

Underlying Write Mechanics

Documents first enter an in‑memory buffer and the translog. When the buffer is near full or after a timed interval (default 1 s), a refresh moves data to the OS cache, making it searchable. Every second a new segment file is created in the OS cache.

Periodically (default every 30 min) or when the translog grows large, a commit (also called flush ) writes the buffered data to disk, fsyncs the OS cache, writes a commit point , and clears the translog.

If the node crashes, data persisted only in the buffer or OS cache may be lost, but the translog allows recovery of up to ~5 seconds of recent writes.

Delete/Update Mechanics

Delete operations generate a .del file marking documents as deleted; updates are implemented as a delete followed by a new write. Regular merges combine segment files, physically removing deleted docs.

Lucene and Inverted Index

Lucene is a Java library that provides the algorithms for building inverted indexes. An inverted index maps each term to the list of document IDs containing that term, enabling fast full‑text search.

Example tables illustrate how documents are tokenized and how term‑to‑doc mappings are stored, with additional metadata such as term frequency.

After a segment file is written, the inverted index for its terms is also created.

Understanding these low‑level details helps explain why ES is near‑real‑time (data becomes searchable after the 1 s refresh) and why a small amount of data can be lost in a crash.

ElasticsearchLuceneInverted IndexrefreshSearchcommitRead ProcessWrite Process
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.