Big Data 26 min read

Elasticsearch Fundamentals: Architecture, Indexing, Query DSL and Search Mechanics

Elasticsearch is a distributed, schemaless search engine built on Lucene that stores JSON documents in sharded indexes, uses immutable segments and merges, provides a flexible Query DSL with aggregations and relevance scoring, and executes distributed query‑then‑fetch searches with features like scrolling, optimistic locking, and zero‑downtime reindexing.

Tencent Music Tech Team

Jan 4, 2022

Elasticsearch Fundamentals: Architecture, Indexing, Query DSL and Search Mechanics

Elasticsearch is a distributed, schemaless search engine built on the Lucene library. It provides HTTP APIs and stores data as JSON documents, making it accessible from many programming languages.

Core concepts include:

Index – a container for documents of a similar type.

Type – deprecated since 7.0; the only type is _doc.

Document – the smallest searchable unit, stored as JSON with fields of various data types.

Mapping – defines the schema of a document; can be created dynamically or explicitly.

Settings – control shard count, replica count and other index‑level properties.

Elasticsearch stores data in shards (primary and replica). A shard is essentially a Lucene Index composed of immutable Segments. The immutable design of Lucene’s inverted index enables fast concurrent reads and simplifies recovery.

Indexing process :

Document is written to an in‑memory Index Buffer.

Buffer is flushed to a new segment (refresh, default every 1 s via index.refresh_interval).

Changes are also appended to the transaction log (WAL) for durability.

Segments are periodically persisted to disk (flush, default every 30 min or when the translog reaches 512 MB).

When many small segments accumulate, Elasticsearch performs a merge operation, which also removes documents marked for deletion (recorded in .del files). A manual merge can be triggered with POST index_name/_forcemerge.

Query DSL offers two ways to search:

URI Search – simple URL parameters (e.g., GET /_search?q=name:John).

Request Body Search – full JSON DSL for complex queries.

Typical DSL snippets:

{
  "query": { "match_all": {} }
}

Aggregations allow bucket and metric calculations (e.g., min, max, avg) and pipeline aggregations for nested analytics.

Relevance scoring :

TF‑IDF – term frequency multiplied by inverse document frequency.

BM25 – modern default that caps score growth for very high term frequencies.

Scoring can be customized with function_score, script_score, or by adjusting field boosts.

Analyzers and tokenizers break text into terms. Elasticsearch ships with standard, whitespace, and language‑specific analyzers, and supports plugins such as ICU, IK, and THULAC for Chinese tokenization. An analyzer consists of character filters, a tokenizer, and token filters.

Distributed search execution follows a two‑phase query‑then‑fetch model:

Coordinating node forwards the query to relevant primary/replica shards; each shard returns sorted _id and score for the requested size.

The coordinating node merges results, selects the final page, and fetches the full documents via a multi‑get request.

Deep pagination can cause performance issues. Elasticsearch mitigates this with: search_after – uses the last hit’s sort values to fetch the next page. scroll – creates a point‑in‑time snapshot for efficient scrolling through large result sets.

Concurrency control relies on optimistic locking using internal sequence numbers ( _seq_no) and primary terms ( _primary_term), or external versioning via version and version_type=external.

Additional useful features include index templates, dynamic templates, index aliases for zero‑downtime reindexing, and completion suggester (FST‑based) for fast autocomplete.

Overall, Elasticsearch combines a powerful inverted‑index engine with distributed scalability, making it suitable for real‑time search, analytics, and log processing workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

indexing search engine distributed architecture Elasticsearch Analyzers Query DSL Relevance Scoring

Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.