Big Data 26 min read

Elasticsearch Fundamentals: Architecture, Indexing, Query DSL and Search Mechanics

Elasticsearch is a distributed, schemaless search engine built on Lucene that stores JSON documents in sharded indexes, uses immutable segments and merges, provides a flexible Query DSL with aggregations and relevance scoring, and executes distributed query‑then‑fetch searches with features like scrolling, optimistic locking, and zero‑downtime reindexing.

Tencent Music Tech Team
Tencent Music Tech Team
Tencent Music Tech Team
Elasticsearch Fundamentals: Architecture, Indexing, Query DSL and Search Mechanics

Elasticsearch is a distributed, schemaless search engine built on the Lucene library. It provides HTTP APIs and stores data as JSON documents, making it accessible from many programming languages.

Core concepts include:

Index – a container for documents of a similar type.

Type – deprecated since 7.0; the only type is _doc .

Document – the smallest searchable unit, stored as JSON with fields of various data types.

Mapping – defines the schema of a document; can be created dynamically or explicitly.

Settings – control shard count, replica count and other index‑level properties.

Elasticsearch stores data in shards (primary and replica). A shard is essentially a Lucene Index composed of immutable Segments . The immutable design of Lucene’s inverted index enables fast concurrent reads and simplifies recovery.

Indexing process :

Document is written to an in‑memory Index Buffer .

Buffer is flushed to a new segment (refresh, default every 1 s via index.refresh_interval ).

Changes are also appended to the transaction log (WAL) for durability.

Segments are periodically persisted to disk (flush, default every 30 min or when the translog reaches 512 MB).

When many small segments accumulate, Elasticsearch performs a merge operation, which also removes documents marked for deletion (recorded in .del files). A manual merge can be triggered with POST index_name/_forcemerge .

Query DSL offers two ways to search:

URI Search – simple URL parameters (e.g., GET /_search?q=name:John ).

Request Body Search – full JSON DSL for complex queries.

Typical DSL snippets:

{
  "query": { "match_all": {} }
}

Aggregations allow bucket and metric calculations (e.g., min , max , avg ) and pipeline aggregations for nested analytics.

Relevance scoring :

TF‑IDF – term frequency multiplied by inverse document frequency.

BM25 – modern default that caps score growth for very high term frequencies.

Scoring can be customized with function_score , script_score , or by adjusting field boosts.

Analyzers and tokenizers break text into terms. Elasticsearch ships with standard, whitespace, and language‑specific analyzers, and supports plugins such as ICU, IK, and THULAC for Chinese tokenization. An analyzer consists of character filters, a tokenizer, and token filters.

Distributed search execution follows a two‑phase query‑then‑fetch model:

Coordinating node forwards the query to relevant primary/replica shards; each shard returns sorted _id and score for the requested size .

The coordinating node merges results, selects the final page, and fetches the full documents via a multi‑get request.

Deep pagination can cause performance issues. Elasticsearch mitigates this with:

search_after – uses the last hit’s sort values to fetch the next page.

scroll – creates a point‑in‑time snapshot for efficient scrolling through large result sets.

Concurrency control relies on optimistic locking using internal sequence numbers ( _seq_no ) and primary terms ( _primary_term ), or external versioning via version and version_type=external .

Additional useful features include index templates, dynamic templates, index aliases for zero‑downtime reindexing, and completion suggester (FST‑based) for fast autocomplete.

Overall, Elasticsearch combines a powerful inverted‑index engine with distributed scalability, making it suitable for real‑time search, analytics, and log processing workloads.

indexingsearch engineDistributed ArchitectureElasticsearchAnalyzersQuery DSLRelevance Scoring
Tencent Music Tech Team
Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.