How Elasticsearch Delivers Lightning‑Fast Search with Inverted Indexes and Compression
This article explains how Elasticsearch uses inverted indexes, term dictionaries, and advanced compression techniques such as Frame‑of‑Reference and Roaring Bitmaps to achieve rapid search performance while minimizing memory and disk usage, and it also covers practical indexing tips for production use.
Search
Imagine searching a collection of ancient poems for verses containing the character "前". A traditional relational database would require a sequential scan using a SQL query like
<code>select name from poems where content like "%前%";</code>, which is slow and inefficient for large datasets.
Search Engine Principles
Search engines follow four main steps: crawling and stop‑word filtering, tokenization to extract keywords, building an inverted index from those keywords, and finally matching user queries against the index.
The core of Elasticsearch’s speed lies in its inverted index implementation, which is provided by the underlying Lucene library.
Inverted Index
After building the inverted index, a query for "前" can instantly locate matching poems. The index consists of three key concepts:
term : the keyword (called a term in ES).
postings list : the list of document IDs that contain a given term.
segment : each shard stores data in multiple segments, each holding up to 2^31 documents with unique integer IDs.
Terms are stored in a term dictionary , which is ordered for fast binary search, similar to a B‑tree index in MySQL. The term index (implemented as a Finite State Transducer, FST) maps term prefixes to dictionary offsets, enabling rapid lookup.
Techniques for Optimizing Postings Lists
Postings lists can become large, so Elasticsearch applies several optimizations:
Compression using Frame of Reference (FOR) : postings are stored as ordered integer arrays and delta‑encoded, allowing each delta to fit in a single byte when values are small.
Further block‑level compression: each block (typically 256 documents) stores a header indicating the bit width needed for its IDs.
Use of Roaring Bitmaps for filter caches, enabling fast bitmap‑based AND/OR operations for frequent filter queries.
FOR compression reduces storage from fixed‑size integers (8‑64 bits) to a variable width of 1‑64 bits, dramatically shrinking disk usage.
Joint Queries
When multiple postings lists must be intersected, Elasticsearch first selects the shortest list and then uses a skip‑list algorithm to efficiently find common document IDs, skipping over non‑matching entries and avoiding unnecessary decompression of blocks.
Summary
Elasticsearch achieves fast search by:
Using an inverted index (term → dictionary → postings list) with FST‑compressed term dictionaries.
Applying FOR compression to postings lists to minimize disk space.
Caching filter results with Roaring Bitmaps for high‑frequency queries.
Leveraging skip‑list based intersection for joint queries when filter caches are unavailable.
Elasticsearch Indexing Guidelines
Explicitly disable indexing for fields that do not need it, as ES indexes everything by default.
For string fields that do not require analysis, set them to not_analyzed.
Avoid highly random IDs (e.g., UUIDs); use monotonically increasing or patterned IDs to improve query performance.
Understanding these mechanisms helps you compare Elasticsearch’s indexing strategy with traditional databases like MySQL, where B‑tree indexes favor write performance, while ES optimizes for read‑heavy search workloads.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.