Big Data 36 min read

How Elasticsearch Leverages Lucene’s Inverted Index for Real‑Time Distributed Search

This article explains the fundamentals of structured and unstructured data, introduces Lucene’s inverted index, and details how Elasticsearch builds on Lucene to provide distributed, near‑real‑time search with concepts such as clusters, shards, replicas, routing, and performance optimizations.

Efficient Ops
Efficient Ops
Efficient Ops
How Elasticsearch Leverages Lucene’s Inverted Index for Real‑Time Distributed Search

1. Data in Everyday Life

Data can be divided into structured data (tables stored in relational databases) and unstructured data (documents, HTML, images, audio, video, etc.). Structured data is searchable via SQL, while unstructured data requires full‑text search.

2. Introduction to Lucene

Lucene is an open‑source library that provides inverted‑index search capabilities. It is the core engine behind Solr and Elasticsearch.

Inverted indexing works by tokenizing each document into

terms

, building a term dictionary and a post list that maps each term to the documents containing it.

Key terms:

Term : the smallest searchable unit (a word in English, a token in Chinese).

Term Dictionary : the collection of all unique terms.

Post List : for each term, the list of document IDs and positions where it occurs.

Inverted File : the physical file storing the post lists.

3. Core Concepts of Elasticsearch (ES)

ES is a Java‑based distributed search and analytics engine built on top of Lucene. It provides a RESTful API that hides Lucene’s complexity.

Key characteristics:

Distributed real‑time document store where every field can be indexed and searched.

Scalable to hundreds of nodes and capable of handling petabytes of structured or unstructured data.

Cluster

A cluster consists of one or more nodes sharing the same

cluster.name

. Nodes can be master‑eligible, data, or coordinating nodes.

Discovery (Zen Discovery)

Nodes discover each other via unicast hosts defined in

<code>discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]</code>

. The master election uses the node IDs sorted lexicographically. The setting

discovery.zen.minimum_master_nodes

defines the quorum required to avoid split‑brain scenarios.

Node Roles

<code>node.master: true  // whether the node can be elected master</code>
<code>node.data: true    // whether the node stores data</code>

Data nodes handle storage and heavy operations; master nodes manage cluster state and shard allocation.

Shards and Replicas

Each index is divided into a fixed number of primary shards (defined at index creation) and optional replica shards for high availability.

<code>PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}</code>

Writes go to the primary shard and are then replicated to its replicas. The routing formula determines the target shard:

<code>shard = hash(routing) % number_of_primary_shards</code>

Routing defaults to the document

_id

but can be customized.

Mapping

Mapping defines field types, analyzers, and storage options, similar to a database schema. Example:

<code>PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}</code>

Basic Usage

Start ES with

bin/elasticsearch

(default port 9200). Verify with

curl http://localhost:9200/

which returns cluster information.

<code>{
  "name": "U7fp3O9",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "-Rj8jGQvRIelGd9ckicUOA",
  "version": {
    "number": "6.8.1",
    "build_flavor": "default",
    "build_type": "zip",
    "build_hash": "1fad4e1",
    "build_date": "2019-06-18T13:16:52.517138Z",
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}</code>

Cluster Health

<code>{
  "cluster_name": "lzj",
  "status": "yellow",
  "number_of_nodes": 1,
  "active_primary_shards": 9,
  "active_shards": 9,
  "unassigned_shards": 5,
  "active_shards_percent_as_number": 64.28
}</code>

Health colors: green (fully functional), yellow (primary shards OK, some replicas missing), red (some primary shards unavailable).

Write Path

Documents are first written to memory and a transaction log (translog). Periodic

refresh

creates a new segment visible to search;

flush

persists segments and clears the translog.

<code>POST/_refresh   // refresh all indices</code>
<code>POST/nba/_refresh   // refresh a specific index</code>

Refresh interval can be tuned, e.g.,

refresh_interval: "30s"

or disabled with

-1

.

Segments and Merging

Segments are immutable on‑disk files. New data creates new segments; deletions are recorded in a

.del

file. Periodic background merges combine small segments into larger ones, reclaiming space and improving search performance.

Performance Optimizations

Use SSDs, RAID10/5, and multiple data paths; avoid remote mounts.

Choose appropriate shard count at index creation; avoid changing it later.

Prefer

keyword

over

text

for fields not needing full‑text search.

Disable

doc_values

on fields not used for sorting or aggregations.

Adjust

index.refresh_interval

and

number_of_replicas

during bulk indexing.

Use

scroll

for deep pagination instead of large

from+size

queries.

Set routing values to target specific shards when possible.

JVM Tuning

Set

Xms

and

Xmx

to the same value (typically ≤50% of physical RAM and ≤32 GB). Consider G1GC over CMS. Ensure sufficient heap for file‑system cache.

ElasticsearchShardingLuceneReplicationInverted Indexdistributed search
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.