Backend Development 8 min read

Using Whoosh for Lightweight Full-Text Search in Python

This article introduces the lightweight Python search library Whoosh, explains its features, demonstrates how to define schemas, create indexes from a CSV dataset, and perform full‑text queries with example code, including searching.

Python Programming Learning Circle

Jun 28, 2022

Using Whoosh for Lightweight Full-Text Search in Python

This article provides a concise introduction to Whoosh, a pure‑Python search engine library created by Matt Chaput. Whoosh is lightweight, supports Python 2 and 3, uses the BM25F ranking algorithm by default, stores indexes as small Unicode files, and can store arbitrary Python objects.

The official documentation is at https://whoosh.readthedocs.io/en/latest/intro.html . Compared with heavyweight solutions like Elasticsearch or Solr, Whoosh is easier to set up for small‑scale search projects.

Index & query – Similar to Elasticsearch, Whoosh requires defining a mapping (schema) and then performing queries. The following example shows how to build an index from a CSV file of poems and query the content field.

Data

The example uses poem.csv, a CSV file containing four columns: title, dynasty, poet, and content.

Fields

Define a schema with four fields, using TEXT for searchable text (with a Chinese analyzer) and ID for exact values.

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# 创建schema, stored为True表示能够被检索
schema = Schema(
    title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
    dynasty=ID(stored=True),
    poet=ID(stored=True),
    content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)

Create index files

Parse poem.csv, write the documents into an index directory indexdir/, and commit the writer.

# 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

# 按照schema定义信息，增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After committing, the indexdir folder contains the index files for each field.

Query

Open a searcher and find documents where the content field contains the term "明月". The example prints the total number of matches and the first ten results.

# 创建一个检索器
searcher = ix.searcher()

# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The output shows 44 matching poems, with fields such as title, dynasty, poet, and content displayed in JSON format.

Finally, the article ends with a brief thank‑you note and a disclaimer that the content is collected from the web and the original author retains copyright.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python indexing search engine Full-Text Search whoosh

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.