Introduction to Whoosh: A Lightweight Python Search Library with Example Code
This article introduces the lightweight Python search library Whoosh, outlines its features, explains how to define schemas, create indexes, and perform queries with example code, and compares it to larger search engines, making it suitable for small search projects.
Whoosh is a pure‑Python, lightweight search engine library created by Matt Chaput, originally for Houdini documentation and now a mature open‑source tool supporting both Python 2 and 3. Its main advantages include no compilation required, default Okapi BM25F ranking, small index files, Unicode‑encoded indexes, and the ability to store arbitrary Python objects.
The core concepts of Whoosh revolve around building an index (defining fields and storing documents) and executing queries (searching those fields). Users familiar with Elasticsearch will find the mapping and query ideas similar.
Below is a minimal example that demonstrates how to define a schema with four fields (title, dynasty, poet, content) using the ChineseAnalyzer from jieba :
# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json
# Create schema, stored=True makes the field retrievable
schema = Schema(
title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
dynasty=ID(stored=True),
poet=ID(stored=True),
content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)To build the index, the script reads poem.csv , parses each line into the four fields, creates the index directory, and adds each document:
# Parse poem.csv
with open('poem.csv', 'r', encoding='utf-8') as f:
texts = [_ .strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
os.mkdir(indexdir)
ix = create_in(indexdir, schema)
writer = ix.writer()
for i in range(1, len(texts)):
title, dynasty, poet, content = texts[i]
writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()After the index is created, a searcher can be used to find documents containing a specific term, such as the Chinese characters "明月" in the content field:
# Create a searcher
searcher = ix.searcher()
# Search for '明月' in content
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
print(json.dumps(results[i].fields(), ensure_ascii=False))The output shows the total number of matching documents (44 in the example) and prints the first ten matching poems with their title, dynasty, poet, and content.
At the end of the article, a QR code is provided for readers to claim a free Python public‑course package containing extensive learning materials, which is promotional but unrelated to the technical content.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.