Big Data 37 min read

Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, inverted indexes, cluster components, node roles, shard and replica mechanisms, mapping, installation, health monitoring, write path, storage strategies, segment management, refresh and translog processes, as well as practical performance and JVM tuning tips.

Architect

Feb 6, 2022

Elasticsearch is an open‑source, distributed, near‑real‑time search and analytics engine built on top of Apache Lucene. It abstracts Lucene’s complexity and offers a simple RESTful API for indexing and querying large volumes of structured and unstructured data.

Data Types in Real Life

Data can be classified as structured (e.g., relational tables) or unstructured (e.g., documents, images, videos). Structured data is typically searched via relational databases, while unstructured data requires full‑text search.

Full‑Text Search Foundations

Lucene provides the core full‑text capabilities through an inverted index . The index consists of a term dictionary (list of unique terms) and a postings list (documents containing each term).

Term          Doc_1    Doc_2   Doc_3
-------------------------------------
Java          |   X|   |
is            |   X|   X|
the           |   X|   X|
best          |   X|   X|
programming   |   X|   X|
language      |   X|   X|
PHP           |       X|   |
Javascript    |           X|
-------------------------------------

Elasticsearch Core Concepts

Elasticsearch clusters consist of one or more nodes that share the same cluster.name. Nodes can be master‑eligible (participate in elections) and/or data nodes (store and process documents). The built‑in Zen Discovery module handles node discovery and master election using unicast or file‑based lists.

Node Roles

Master node : creates/deletes indices, tracks cluster state, allocates shards.

Data node : stores primary and replica shards, handles CRUD and aggregations.

Coordinating node : any node that receives client requests, routes them to the appropriate shards, and merges results.

Shards and Replicas

Indices are split into a configurable number of primary shards . Each primary shard can have multiple replica shards for high availability. The number of primary shards is fixed at index creation; replicas can be added later.

PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Mapping

Mapping defines field types, analyzers, and storage options, similar to a database schema. Fields can be text (analyzed) or keyword (exact value). Explicit mapping is preferred for predictable behavior.

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}

Installation and Basic Usage

Elasticsearch is distributed as a zip/tar archive; no installation is required. After extracting, run bin/elasticsearch. The default HTTP port is 9200.

{
  "name" : "U7fp3O9",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "6.8.1",
    "build_flavor" : "default",
    "lucene_version" : "7.7.0"
  },
  "tagline" : "You Know, for Search"
}

Cluster Health

Health can be queried via GET /_cluster/health and is reported as green , yellow , or red .

{
  "cluster_name" : "wujiajian",
  "status" : "yellow",
  "number_of_nodes" : 1,
  "active_primary_shards" : 9,
  "active_shards" : 9,
  "unassigned_shards" : 5
}

Write Path and Routing

Documents are routed to a primary shard using the formula shard = hash(routing) % number_of_primary_shards. By default, routing is the document _id. The coordinating node forwards the request to the target primary shard, which then replicates to its replicas.

Storage Model

Elasticsearch stores data on disk as immutable segments . New documents are first written to the JVM heap, then flushed to a new segment (refresh) and eventually persisted to disk (flush). Deletions are recorded in a .del file; the actual bytes are reclaimed during segment merging.

Refresh and Translog

Refresh occurs every second by default, making newly indexed documents searchable. The transaction log (translog) records all operations not yet flushed, ensuring durability across crashes. When the translog reaches 512 MB or 30 minutes, a flush creates a new commit point and clears the translog.

Segment Merging

Background merges combine small segments into larger ones, removing deleted documents and reducing the number of file handles. Merges are throttled to avoid impacting indexing throughput.

Performance Optimizations

Use SSDs and RAID‑0 or multiple path.data directories for higher I/O.

Avoid remote mounts (NFS, SMB) for data directories.

Prefer sequential, compressible document IDs over random UUIDs.

Disable doc_values on fields that are never aggregated or sorted.

Use keyword instead of text for exact‑match fields.

Increase index.refresh_interval (e.g., to 30s) for bulk indexing, or set it to -1 to disable automatic refresh.

During massive imports, set number_of_replicas to 0 and re‑enable after the load.

Prefer scroll API over deep pagination to avoid large in‑memory priority queues.

JVM Tuning

Set -Xms and -Xmx to the same value (no more than 50 % of physical RAM, and ≤ 32 GB).

Consider using the G1 garbage collector instead of CMS.

Allocate sufficient RAM for the operating system’s file‑system cache (at least half of total memory).

By understanding these concepts and applying the recommended configurations, users can deploy, scale, and maintain Elasticsearch clusters efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Elasticsearch inverted index Cluster Management Distributed Search

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.