Big Data 39 min read

Elasticsearch Overview: Data Types, Lucene Foundations, Core Concepts, Cluster Architecture, Indexing, Storage, and Performance Optimization

This article provides a comprehensive introduction to Elasticsearch, covering the distinction between structured and unstructured data, Lucene’s inverted index, ES core concepts such as clusters, nodes, shards and replicas, mapping, basic usage, storage mechanisms, and practical performance‑tuning tips for large‑scale search deployments.

Architect's Guide

Oct 27, 2022

Elasticsearch Overview: Data Types, Lucene Foundations, Core Concepts, Cluster Architecture, Indexing, Storage, and Performance Optimization

1. Data in Everyday Life

Search engines retrieve data, which can be divided into structured data (row‑based tables stored in relational databases) and unstructured data (documents, emails, images, videos, etc.). Structured data can be searched via SQL, while unstructured data requires full‑text search.

Structured data: fixed format, stored in relational databases.

Unstructured data: variable length, not suitable for tabular representation; includes XML, HTML, Word, PDFs, images, audio, video.

Search for these two data types follows the same division: structured‑data search and unstructured‑data search.

2. Introduction to Lucene

Lucene is an open‑source Java library that provides the core inverted‑index functionality for full‑text search. It is not a complete search engine by itself; higher‑level engines such as Solr and Elasticsearch are built on top of Lucene.

The inverted index consists of a term dictionary (list of unique terms) and a post list (for each term, the list of documents containing it). Example term‑document matrix is shown in the code block below.

Term      Doc_1  Doc_2  Doc_3
--------------------------------
Java        X               
is          X      X      X
the         X      X      X
best        X      X      X
programming X      X      X
language    X      X      X
PHP                X      
Javascript               X
--------------------------------

Key terminology:

Term : smallest searchable unit (a word in English, a token after Chinese segmentation).

Term Dictionary : collection of all terms.

Post List : list of document IDs (and optionally positions, frequencies) for each term.

Inverted File : physical file that stores the post lists.

3. Core Concepts of Elasticsearch

Elasticsearch is a distributed, near‑real‑time search and analytics engine built in Java and powered by Lucene. It provides a RESTful API that hides Lucene’s complexity.

Key characteristics:

Distributed real‑time document store where every field can be indexed and searched.

Real‑time analytics search engine.

Scales to hundreds of nodes and supports petabyte‑scale structured or unstructured data.

Cluster

A cluster is a group of one or more Elasticsearch nodes that share the same cluster.name. Nodes discover each other via the built‑in Zen Discovery module (unicast or file‑based discovery). The master‑eligible nodes elect a master that manages cluster state, index creation, shard allocation, and health monitoring.

Node Roles

node.master: true

– candidate for master election. node.data: true – stores data and performs CRUD and aggregation operations.

Coordinator nodes (any node that receives a client request) forward the request to the appropriate primary shard.

Split‑Brain (Brain Split) Problem

When network partitions cause multiple masters to be elected, data divergence can occur. The typical mitigation is to set discovery.zen.minimum_master_nodes (quorum) so that a master is elected only when a majority of master‑eligible nodes are reachable.

Shards and Replicas

Each index is divided into a configurable number of primary shards; each primary can have zero or more replica shards. Shards enable horizontal scaling. By default Elasticsearch creates 5 primary shards and 1 replica per primary.

PUT /myIndex {
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Replica shards provide high availability and increase read throughput. The cluster state can be green (all primary and replica shards active), yellow (all primaries active but some replicas missing), or red (one or more primaries missing).

Mapping

Mapping defines how each field is stored and indexed (similar to a database schema). Types include text (analyzed for full‑text search) and keyword (exact value, used for filtering, sorting, aggregations). Mapping can be dynamic (Elasticsearch guesses the type) or explicit (user‑defined).

PUT my_index {
  "settings": {"number_of_shards":5,"number_of_replicas":1},
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type":"text"},
        "name":  {"type":"text"},
        "age":   {"type":"integer"},
        "created": {"type":"date","format":"strict_date_optional_time||epoch_millis"}
      }
    }
  }
}

4. Basic Usage

Download and unpack Elasticsearch; no installation is required. Important directories: bin (executables), config, data, logs, plugins, etc.

Start the node with bin/elasticsearch. By default it listens on port 9200. A simple curl http://localhost:9200/ returns cluster information.

5. Mechanisms and Principles

Routing and Shard Allocation

Document routing determines the target primary shard using the formula:

shard = hash(routing) % number_of_primary_shards

By default routing is the document _id. The same formula is used by every node, allowing any node to act as a coordinator.

Write Path

Client sends a write request to a coordinator node.

Coordinator computes the target primary shard and forwards the request.

Primary shard writes the document to its transaction log (translog) and to an in‑memory buffer.

After the primary acknowledges, the request is replicated to all replica shards; the operation is considered successful only after all replicas confirm.

Storage Model

Elasticsearch stores data on disk as immutable segments . A segment is a small Lucene index file; once written it never changes. New documents create new segments; deletions are recorded in a .del file, and updates are a delete‑plus‑add.

Segments are periodically merged in the background to reduce the number of files, reclaim space from deleted documents, and improve search performance.

Refresh, Flush, and Translog

Elasticsearch uses a delayed‑write strategy:

Refresh (default every 1 s) makes newly indexed documents visible for search by writing a new segment to the file‑system cache.

Flush (triggered when the translog reaches 512 MB or 30 min) writes in‑memory segments to disk, fsyncs them, creates a commit point, and clears the translog.

The translog guarantees durability; on restart ES replays any uncommitted operations from the translog.

Segment Merging

Because each refresh creates a new segment, the number of segments can explode. Background merge threads combine small segments into larger ones, discarding deleted documents and reducing file‑handle, memory, and CPU overhead.

6. Performance Optimization

Storage Devices

Use SSDs and, if possible, RAID 0 for maximum I/O throughput.

Avoid remote mounts (NFS, SMB) and be cautious with cloud block storage (e.g., AWS EBS).

Index Settings

Prefer sequential, compressible document IDs over random UUIDs.

Disable doc_values on fields that are never used for sorting or aggregations.

Use keyword instead of text for fields that do not need full‑text analysis.

Increase index.refresh_interval (e.g., to 30s) or set it to -1 during bulk indexing, and temporarily set number_of_replicas: 0.

Prefer scroll API over deep pagination to avoid costly from+size sorting.

Limit the number of mapped fields to only those required for search, aggregation, or sorting.

Provide explicit routing values when possible to target specific shards.

JVM Tuning

Set -Xms and -Xmx to the same value (no more than 50 % of physical RAM and not exceeding 32 GB).

Consider using the G1 garbage collector instead of the default CMS.

Allocate sufficient RAM for the operating system’s file‑system cache (at least half of total memory).

By following these guidelines, Elasticsearch can deliver fast, reliable, and scalable search capabilities for both structured and unstructured data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Performance Optimization indexing search engine Elasticsearch lucene

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.