Fundamentals 8 min read

Introduction to Whoosh: A Lightweight Python Search Library with Example Code

This article introduces the lightweight Python search library Whoosh, outlines its key features, demonstrates how to define a schema, create an index from a CSV dataset, and perform full‑text queries with example code, making it suitable for small‑scale search projects.

Python Programming Learning Circle

Mar 25, 2023

Introduction to Whoosh: A Lightweight Python Search Library with Example Code

Whoosh was created by Matt Chaput, initially to provide simple, fast search for Houdini 3D documentation, and has since grown into a mature, open‑source search solution written entirely in Python, supporting both Python 2 and Python 3.

Key advantages include: pure‑Python implementation requiring only a Python environment; default use of the Okapi BM25F ranking algorithm with support for alternatives; compact index files; Unicode‑encoded index data; and the ability to store arbitrary Python objects.

The official introduction site is https://whoosh.readthedocs.io/en/latest/intro.html . Compared with heavyweight engines such as ElasticSearch or Solr, Whoosh is lighter and easier to use, making it a good choice for small‑scale search projects.

In Whoosh, the two core aspects of search are the index (mapping) and the query, similar to concepts in ElasticSearch. Building an index defines how fields are stored and analyzed, while a query parses the search string and applies ranking algorithms to retrieve results.

The example dataset used is poem.csv, a CSV file containing four columns: title, dynasty, poet, and content. The first ten rows of the file are shown in the original article.

Schema definition (Python code):

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# 创建schema, stored为True表示能够被检索
schema = Schema(
    title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
    dynasty=ID(stored=True),
    poet=ID(stored=True),
    content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)

In this schema, ID fields store a single token (useful for paths, URLs, dates, categories) while TEXT fields store full text and support tokenized search; the ChineseAnalyzer provides Chinese word segmentation.

Index creation (Python code):

# 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_ .strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

# 按照schema定义信息，增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After running this code, an indexdir directory is created containing the index files for all fields of poem.csv.

Querying the index (Python code):

# 创建一个检索器
searcher = ix.searcher()

# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The sample output shows that 44 documents contain the term "明月", and the first ten matching poems are printed with their title, dynasty, poet, and content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python indexing search engine Full-Text Search whoosh

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.