Big Data 11 min read

Overview of Elasticsearch Architecture and Optimization Strategies

This article explains Elasticsearch's architecture, including its reliance on Apache Lucene, shard and replica design, routing optimization, JVM garbage‑collection tuning, memory‑locking, and index‑merge control, offering practical guidance for building and operating high‑performance search clusters.

Big Data Technology Architecture

Jul 3, 2020

Overview of Elasticsearch Architecture and Optimization Strategies

Elasticsearch Architecture Overview

Elasticsearch is a leading big‑data engine, often combined with Logstash and Kibana to form a mature logging system; Logstash acts as an ETL tool and Kibana as a data‑analysis and visualization platform. Elasticsearch’s strength lies in its powerful search capabilities, disaster‑recovery strategies, extensible plugin interfaces, and Chinese‑tokenizer plugins that boost search and analysis. It builds on the open‑source full‑text search library Apache Lucene for indexing and searching, so its architecture must interact with Lucene internals.

Apache Lucene organizes all indexed information into an inverted index, a data structure that maps terms to documents. Unlike traditional relational databases, an inverted index is term‑oriented. Lucene indexes also store additional data such as term vectors; each index consists of multiple immutable segments that are created once and queried many times. Segments are merged according to Lucene’s internal mechanisms, a process that is I/O‑intensive but frees unused data. Analysis—performed by an Analyzer composed of a Tokenizer, Filters, and Character Mappers—converts raw text into searchable terms, and Lucene provides its own query language for search and read/write operations.

Key Design Principles of Elasticsearch

Reasonable default configuration: a simple edit of the yaml file is enough to get a cluster up, similar to Spring’s configuration simplification.

Distributed operation mode: the powerful Zen discovery mechanism supports both broadcast and unicast, embodying the “know one, know all” principle.

Peer‑to‑peer architecture: shards are automatically replicated across nodes, and master and data nodes are almost equivalent, reducing single‑point failures.

Easy cluster expansion: adding new nodes to a cluster is straightforward for developers and operators.

No restrictions on index data structures: a single index can hold multiple data types.

Near‑real‑time search and version synchronization: despite the consistency challenges of a distributed system, Elasticsearch performs excellently.

Shard Strategy

Choose appropriate numbers of primary shards and replicas. By default, Elasticsearch creates five primary shards per index (pre‑7.x). In a single‑node environment this over‑allocation adds unnecessary complexity; the optimal practice is to use the minimum number of shards required.

The relationship between node count, primary shards, and replicas is: nodeCount <= primaryShards * (replicas + 1) Shard allocation can be tuned after index creation by setting cluster.routing.allocation.type to even_shard (balanced number of shards per node) or balanced (weight‑based allocation).

Shard rebalancing occurs when the cluster topology changes, such as when new data nodes join. Elasticsearch has eleven built‑in deciders that decide when to trigger reallocation; these settings can be updated at runtime.

Routing Optimization

Routing in Elasticsearch is a tag‑like attribute attached to a document at index time. Documents sharing the same routing value are stored on the same shard, allowing queries that specify the routing value to target a single shard directly, reducing distributed coordination and improving performance. It also provides resilience: if a node holding other routing values fails, queries for the specified routing continue unaffected.

GC Tuning on Elasticsearch

Since Elasticsearch runs on the JVM, configuring the garbage collector is essential. The Xms and Xmx settings define heap size; insufficient heap leads to OutOfMemoryError. Common troubleshooting steps include enabling GC logs, using the jstat command to inspect heap usage and GC times, and generating heap dumps for analysis.

GC‑related settings can also be adjusted via elasticsearch.yml or JVM startup parameters.

Avoiding Memory Swapping

Operating‑system swap can degrade performance. Setting bootstrap.mlockall: true in elasticsearch.yml locks the JVM memory, preventing swapping, but requires root privileges and OS configuration changes.

Controlling Index Merges

Elasticsearch shards and replicas are Lucene indexes composed of multiple immutable segments. When many small segments accumulate, Lucene merges them into larger segments, reducing the number of files and improving query performance. Elasticsearch exposes three merge policies (tiered, log_byte_size, log_doc) and two merge schedulers (concurrent, serial) that can be tuned.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Elasticsearch Sharding Routing GC tuning

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.