Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained
This article provides a comprehensive overview of ElasticSearch, covering its definition, core components such as indexes, shards and replicas, the analysis pipeline, inverted index mechanics, and the two‑stage search process that enables scalable, fault‑tolerant full‑text search in big‑data environments.
ElasticSearch Overview
ElasticSearch is a distributed full‑text search engine built on Apache Lucene, widely used in big‑data scenarios for fast, scalable search and analytics.
Core Components
Index : a collection of documents with similar characteristics, containing mapping and inverted‑index files; data may reside on one or many nodes.
Type : logical grouping of similar documents, analogous to a table in relational databases.
Document : the basic searchable unit, represented in JSON, similar to a row.
Field : the smallest unit within a document, comparable to a column.
Shard : a primary partition of an index that enables horizontal scaling; each shard is a physical Lucene index.
There are two shard types: Primary Shard and Replica Shard . Replicas provide redundancy and enable load‑balancing for queries.
ElasticSearch Workflow
Search consists of two stages:
1. Query Phase
Client sends a request to a coordinating node, which broadcasts it to relevant primary or replica shards.
Each shard executes the query locally and builds a priority queue of matching documents.
The coordinating node merges, sorts, and paginates the results from all shards.
2. Fetch Phase
The coordinating node retrieves the full document source for the document IDs returned in the query phase and returns them to the client.
Text Analysis and Inverted Index
ElasticSearch uses analyzers composed of character filters, tokenizers, and token filters to turn raw text into terms stored in an inverted index.
Example character filter removes HTML tags:
<div>
<span>mikechen的互联网架构<span>
</div>Built‑in tokenizers include Standard, Simple, Stop, Whitespace, Keyword, Pattern, and language‑specific analyzers.
Token filters further process tokens (e.g., lower‑casing, stop‑word removal).
Inverted Index
The inverted index maps terms to the list of document IDs containing those terms, enabling rapid full‑text search across massive data sets.
Search Process Summary
Search is executed in the two‑phase query and fetch stages, leveraging distributed shards and replicas to achieve high throughput, low latency, and fault tolerance.
Mike Chen's Internet Architecture
Over ten years of BAT architecture experience, shared generously!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.