Databases 13 min read

How to Build a Distributed Full‑Text Search System Using a Distributed Database

This article explains the design, table schema, indexing workflow, and query processing of a distributed full‑text search system that stores documents and token information separately in a distributed database, improving scalability and performance over traditional Lucene‑based solutions.

Efficient Ops

Nov 13, 2017

How to Build a Distributed Full‑Text Search System Using a Distributed Database

Preface

Open‑source full‑text search projects such as Lucene are popular, but real‑world deployments usually require multiple Lucene instances (e.g., Solr, Elasticsearch). These systems store documents and their token information together, forcing a query to scan all nodes to obtain complete results.

To avoid this, the storage and distribution strategy for documents and tokens must be separated. By using a distributed database to store token information independently, a full‑text search system can be built on top of the database instead of Lucene.

Design Idea

Store data in a distributed database rather than Lucene, using three tables: a document table, a token table (post‑analysis words), and a corpus table (basic token information).

Document and token tables use different sharding keys for precise location: document_id for the document table, term_id for the token table.

SQL is used to express logical operators (AND, OR, NOT) in search conditions.

Distributed Table Design

Document Table

Chinese name

Document table

English name

tab_doc

Sharding key

doc_id

Remark

Stores document metadata; the document content can indicate whether the document is new, updated, or deleted.

Token Table

Chinese name

Token table

English name

tab_term

Sharding key

term_id

Remark

Stores token information.

Corpus Table

Chinese name

Corpus table

English name

tab_corpus

Sharding key

term_id

Remark

Stores inverse document frequency for each token.

Architecture

Document Indexing Process

1. Multi‑threaded download

Obtain document_id from the URL, check the document table; if no record, insert an initial row and set download status to 1, then start downloading. On success, update status to 2 and record download time; on failure, revert to status 0. If a record exists, the current status determines whether to skip, retry, or re‑download based on the last download time.

2. Multi‑threaded analysis and storage

Read doc_parse_status; if 0, set to 1 and start analysis. On success, set status to 2, store the analyzed content, and insert token results (term_id, term_name, doc_id, term_freq) into the token table. Failure resets status to 0. Status 1 means skip, status 2 may trigger re‑analysis after a defined interval.

Document Search Process

Tokenize the user query, derive logical relationships (AND, OR, NOT) and search each token in the corresponding token table.

Combine document IDs according to the logical relationships, retrieve document contents, and compute a score for each document as the sum of (term frequency × inverse document frequency) for all query tokens.

Rank documents by score and return the highest‑scoring results.

Example

Assumptions

100 nodes (0‑99)

10 million crawled documents

Corpus built from 1 billion documents, containing 1 trillion token records

Document ID = sum of four 32‑bit integers of the URL’s MD5

Token ID = sum of four 32‑bit integers of the token’s MD5

Three sample documents and eight sample tokens are used for demonstration.

Indexing

Document table data:

doc_id

doc_content

node_id

8913640109

http://www.infoq.com/.../machine-expert-not-value content

9783095263

http://www.infoq.com/.../transform-machine-learn-70 content

4104526513

http://www.infoq.com/.../Big-data-machine-learning-2016 content

Token table data (sample):

term_id

term

doc_id

freq

7426744410

机器学习

8913640109

10471356074

数据

8913640109

11125465229

问题

8913640109

14126505064

时间

8913640109

Corpus table (inverse document frequency):

term_id

term

idf

7426744410

机器学习

9.9035

10471356074

数据

6.7254

11125465229

问题

5.9915

14126505064

时间

5.7446

Search Example

User query: “机器学习开源”. The system tokenizes to two terms with an AND relationship, retrieves matching rows from the token table, merges document IDs, computes scores (freq × idf), and ranks the documents.

Resulting scores:

doc_id

score

8913640109

495.175

4104526513

419.2228

9783095263

364.3631

The corresponding document contents are then returned to the user with highlighted query terms.

Conclusion

Current popular frameworks (Solr, Elasticsearch) store data in inverted indexes distributed across multiple index instances, requiring queries to hit every node.

Using a distributed database separates document and token storage, allowing precise node targeting and improving performance.

The architecture decouples download, analysis, storage, and query modules, providing configurability and flexibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

indexing Scalability Distributed Search Full-Text Search

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.