Understanding Lucene Inverted Index: Principles and Implementation
This article explains the concept of inverted indexes, their role in full‑text search, and provides a detailed overview of how Apache Lucene implements inverted indexing, including term dictionaries, posting lists, query processing, and numeric handling with BKDTree.
1 Introduction
Lucene, an Apache open‑source search library, powers Solr and Elasticsearch; our team uses it to build an index for hotel, ticket and large‑scale search services.
The core of Lucene’s power is the inverted index.
This article explains the concept of inverted indexes and how Lucene implements them.
2 Basic Principles
2.1 What is an Inverted Index
Full‑text search requires locating words in many documents. Traditional relational databases use LIKE , which forces full table scans and cannot provide relevance scoring.
Cannot use DB indexes → poor performance
Only simple pattern matching → limited search capabilities
No relevance scores
Inverted index stores each term (word) as a key and the list of document IDs containing that term as the value, opposite to a forward index that maps document IDs to their content.
Example generation steps with two sample documents are shown, including a forward index table and the resulting inverted index table.
select * from hotel_table where hotel_name like '%公寓%';2.2 Inverted Index Structure
The index consists of a Term Dictionary (the term table) and a Postings List (the record table). Each posting may contain docId, term frequency (TF), position, and offset.
DocId – unique identifier of a document
TF – number of occurrences of the term in the document
Position – token positions for phrase queries
Offset – start/end character offsets for highlighting
3 Lucene Inverted Index Implementation
3.1 Posting List Implementation
Lucene splits the postings into three files: .doc (docId and TF), .pay (payload and offset), and .pos (position). Most queries only need the .doc file; positional queries also read .pos and .pay.
The .doc file stores a series of TermFreqs and SkipData structures. TermFreqs hold docId/TF pairs, compressed with PackedBlock (using PackedInts) for dense blocks and VIntBlock for sparse blocks.
SkipData is a skip‑list that enables fast jumps over large docId lists, reducing the cost of intersecting posting lists.
3.2 Term Dictionary Implementation
The dictionary is stored in .tim files using NodeBlock compression with shared prefixes. An FST (Finite State Transducer) index ( .tip ) maps term prefixes to file pointers, allowing O(length) lookup and supporting automaton‑based queries such as wildcards and regex.
3.3 Query Process
Obtain the FST for the field from the Term Index.
Use the FST to locate the block that may contain the target term.
Load the block, scan entries, and verify the term suffix.
If found, retrieve posting pointers (doc, pos, pay) from the entry.
Traverse TermFreqs for all docIds or use SkipData for targeted docId access.
4 Numeric Types in Lucene
String‑based inverted indexes are inefficient for numeric or multidimensional data. Lucene introduces BKDTree, a hybrid of KD‑Tree and B+‑Tree, to index numeric fields and support fast range and nearest‑neighbor queries.
5 Conclusion
The article introduced the concept and structure of inverted indexes, detailed Lucene’s Term Dictionary and Posting List designs, explained query execution, and described numeric handling via BKDTree. Lucene’s compact storage and clever data structures make it a valuable reference for building high‑performance search systems.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.