How Elasticsearch Achieves Fast Retrieval: Inverted Index, Term Dictionary, and Compression Techniques
This article explains how Elasticsearch leverages Lucene's inverted index, term dictionary, term index, and compression methods such as Frame‑of‑Reference and Roaring Bitmaps to enable rapid search, efficient storage, and fast set operations for large‑scale data retrieval.
Recently I have been working on several projects that use Elasticsearch (ES) for data storage and search analysis, so I studied ES. This article is based on a technical sharing session.
The article does not discuss ES's distributed architecture or API usage; it focuses on the topic of "how ES performs fast retrieval," which was my main interest before learning ES.
About Search
Consider a scenario where we need to find ancient poems containing the character "前". Using a traditional relational database such as MySQL, we would write a SQL query like:
select name from poems where content like "%前%";This sequential scan requires scanning all records, which is inefficient and does not meet typical search expectations.
Search engines, like the one behind ES, are designed to handle such queries efficiently.
Search Engine Principles
The basic steps of a search engine are:
Content crawling and stop‑word filtering.
Tokenization to extract keywords.
Building an inverted index based on the keywords.
User enters keywords and the engine performs the search.
Elasticsearch is essentially a wrapper around Lucene; the implementation details of the inverted index are provided by Lucene's APIs.
Inverted Index
After building an inverted index, a query like "前" can directly locate matching poems without scanning the entire dataset.
The core components are:
Term : the keyword (called a term in ES).
Postings list : the list of document IDs that contain a given term.
Term dictionary : an ordered list of all terms, enabling binary search.
Term index : a trie‑like structure (implemented as an FST) that maps term prefixes to dictionary offsets, allowing fast lookup.
Lucene stores the term dictionary on disk in blocks with prefix compression, and keeps the term index in memory using a Finite State Transducer (FST), which provides small memory footprint and O(len(str)) lookup time.
Compression Techniques
To reduce the storage cost of postings lists, Lucene employs two main compression methods.
Frame of Reference (FOR)
Postings are stored as ordered integer arrays. By delta‑encoding (storing the difference between successive IDs) each ID becomes small, often fitting in a single byte. Documents are grouped into blocks of 256; each block records the maximum bit width needed for its IDs in a header, allowing the IDs to be packed into the minimal number of bits.
Roaring Bitmaps (for filter cache)
Filters cache the result sets of frequent queries. Instead of storing a full bitmap (which is memory‑intensive for sparse data), Lucene uses Roaring Bitmaps, which split the 32‑bit document IDs into high‑16‑bit and low‑16‑bit parts. The high bits act as keys, and the low bits are stored in either an ArrayContainer (for < 4096 entries) or a BitmapContainer (for larger sets), achieving good compression while retaining fast bitwise operations.
Union Queries
When multiple postings lists need to be intersected (AND) or united (OR), ES first checks if a filter cache is available. If so, the cached bitmap is used directly. Otherwise, a skip‑list algorithm traverses the postings lists, selecting the shortest list and skipping over non‑matching IDs, which also avoids decompressing unnecessary blocks.
Summary
ES uses an inverted index (term index → term dictionary → postings list) to locate target documents quickly, trading higher memory usage for dramatically improved search speed.
Term dictionary and term index enable fast term lookup while minimizing memory and disk I/O.
FOR compression reduces the storage size of postings lists and speeds up queries.
Roaring Bitmaps cache filter results efficiently, balancing space and speed.
For union queries, ES leverages cached bitmaps when possible; otherwise it uses skip‑list traversal to intersect postings lists while skipping unnecessary decompression.
When configuring ES, explicitly disable indexing for fields that do not need it, define non‑analyzed string fields, and choose IDs with low randomness to improve query performance.
Overall, the article demonstrates how Lucene implements an inverted index, optimizes memory and disk usage, and employs clever bit‑level tricks to accelerate search in Elasticsearch.
Author: Richard_Yi Source: https://juejin.cn/post/6889020742366920712
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.