Fundamentals 15 min read

Understanding Lucene Inverted Index: Principles and Implementation

This article explains the concept of inverted indexes, their role in full‑text search, and provides a detailed overview of how Apache Lucene implements inverted indexing, including term dictionaries, posting lists, query processing, and numeric handling with BKDTree.

Qunar Tech Salon

Feb 4, 2021

Understanding Lucene Inverted Index: Principles and Implementation

1 Introduction

Lucene, an Apache open‑source search library, powers Solr and Elasticsearch; our team uses it to build an index for hotel, ticket and large‑scale search services.

The core of Lucene’s power is the inverted index.

This article explains the concept of inverted indexes and how Lucene implements them.

2 Basic Principles

2.1 What is an Inverted Index

Full‑text search requires locating words in many documents. Traditional relational databases use LIKE, which forces full table scans and cannot provide relevance scoring.

Cannot use DB indexes → poor performance

Only simple pattern matching → limited search capabilities

No relevance scores

Inverted index stores each term (word) as a key and the list of document IDs containing that term as the value, opposite to a forward index that maps document IDs to their content.

Example generation steps with two sample documents are shown, including a forward index table and the resulting inverted index table.

<code>select * from hotel_table where hotel_name like '%公寓%';</code>

2.2 Inverted Index Structure

The index consists of a Term Dictionary (the term table) and a Postings List (the record table). Each posting may contain docId, term frequency (TF), position, and offset.

DocId – unique identifier of a document

TF – number of occurrences of the term in the document

Position – token positions for phrase queries

Offset – start/end character offsets for highlighting

3 Lucene Inverted Index Implementation

3.1 Posting List Implementation

Lucene splits the postings into three files: .doc (docId and TF), .pay (payload and offset), and .pos (position). Most queries only need the .doc file; positional queries also read .pos and .pay.

The .doc file stores a series of TermFreqs and SkipData structures. TermFreqs hold docId/TF pairs, compressed with PackedBlock (using PackedInts) for dense blocks and VIntBlock for sparse blocks.

SkipData is a skip‑list that enables fast jumps over large docId lists, reducing the cost of intersecting posting lists.

3.2 Term Dictionary Implementation

The dictionary is stored in .tim files using NodeBlock compression with shared prefixes. An FST (Finite State Transducer) index ( .tip) maps term prefixes to file pointers, allowing O(length) lookup and supporting automaton‑based queries such as wildcards and regex.

3.3 Query Process

Obtain the FST for the field from the Term Index.

Use the FST to locate the block that may contain the target term.

Load the block, scan entries, and verify the term suffix.

If found, retrieve posting pointers (doc, pos, pay) from the entry.

Traverse TermFreqs for all docIds or use SkipData for targeted docId access.

4 Numeric Types in Lucene

String‑based inverted indexes are inefficient for numeric or multidimensional data. Lucene introduces BKDTree, a hybrid of KD‑Tree and B+‑Tree, to index numeric fields and support fast range and nearest‑neighbor queries.

5 Conclusion

The article introduced the concept and structure of inverted indexes, detailed Lucene’s Term Dictionary and Posting List designs, explained query execution, and described numeric handling via BKDTree. Lucene’s compact storage and clever data structures make it a valuable reference for building high‑performance search systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

search engine lucene inverted index Term Dictionary BKDTree Posting List

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.