Backend Development 7 min read

Introduction to Whoosh: A Lightweight Python Search Library with Example Code

This article introduces the lightweight Python search library Whoosh, outlines its features, explains how to define schemas, create indexes, and perform queries with example code, and compares it to larger search engines, making it suitable for small search projects.

Python Programming Learning Circle

May 25, 2024

Introduction to Whoosh: A Lightweight Python Search Library with Example Code

Whoosh is a pure‑Python, lightweight search engine library created by Matt Chaput, originally for Houdini documentation and now a mature open‑source tool supporting both Python 2 and 3. Its main advantages include no compilation required, default Okapi BM25F ranking, small index files, Unicode‑encoded indexes, and the ability to store arbitrary Python objects.

The core concepts of Whoosh revolve around building an index (defining fields and storing documents) and executing queries (searching those fields). Users familiar with Elasticsearch will find the mapping and query ideas similar.

Below is a minimal example that demonstrates how to define a schema with four fields (title, dynasty, poet, content) using the ChineseAnalyzer from jieba:

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# Create schema, stored=True makes the field retrievable
schema = Schema(
    title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
    dynasty=ID(stored=True),
    poet=ID(stored=True),
    content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)

To build the index, the script reads poem.csv, parses each line into the four fields, creates the index directory, and adds each document:

# Parse poem.csv
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_ .strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After the index is created, a searcher can be used to find documents containing a specific term, such as the Chinese characters "明月" in the content field:

# Create a searcher
searcher = ix.searcher()

# Search for '明月' in content
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The output shows the total number of matching documents (44 in the example) and prints the first ten matching poems with their title, dynasty, poet, and content.

At the end of the article, a QR code is provided for readers to claim a free Python public‑course package containing extensive learning materials, which is promotional but unrelated to the technical content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python indexing search engine example-code Full-Text Search whoosh

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.