Using Whoosh for Lightweight Full-Text Search in Python
This article introduces the lightweight Python search library Whoosh, explains its features, demonstrates how to define schemas, create indexes from a CSV dataset, and perform full‑text queries with example code, including searching.
This article provides a concise introduction to Whoosh, a pure‑Python search engine library created by Matt Chaput. Whoosh is lightweight, supports Python 2 and 3, uses the BM25F ranking algorithm by default, stores indexes as small Unicode files, and can store arbitrary Python objects.
The official documentation is at https://whoosh.readthedocs.io/en/latest/intro.html . Compared with heavyweight solutions like Elasticsearch or Solr, Whoosh is easier to set up for small‑scale search projects.
Index & query – Similar to Elasticsearch, Whoosh requires defining a mapping (schema) and then performing queries. The following example shows how to build an index from a CSV file of poems and query the content field.
Data
The example uses poem.csv , a CSV file containing four columns: title, dynasty, poet, and content.
Fields
Define a schema with four fields, using TEXT for searchable text (with a Chinese analyzer) and ID for exact values.
<code># -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json
# 创建schema, stored为True表示能够被检索
schema = Schema(
title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
dynasty=ID(stored=True),
poet=ID(stored=True),
content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)
</code>Create index files
Parse poem.csv , write the documents into an index directory indexdir/ , and commit the writer.
<code># 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]
# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
os.mkdir(indexdir)
ix = create_in(indexdir, schema)
# 按照schema定义信息,增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
title, dynasty, poet, content = texts[i]
writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()
</code>After committing, the indexdir folder contains the index files for each field.
Query
Open a searcher and find documents where the content field contains the term "明月". The example prints the total number of matches and the first ten results.
<code># 创建一个检索器
searcher = ix.searcher()
# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
print(json.dumps(results[i].fields(), ensure_ascii=False))
</code>The output shows 44 matching poems, with fields such as title, dynasty, poet, and content displayed in JSON format.
Finally, the article ends with a brief thank‑you note and a disclaimer that the content is collected from the web and the original author retains copyright.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.