How to Build a Distributed Full‑Text Search System Using a Distributed Database
This article explains the design, table schema, indexing workflow, and query processing of a distributed full‑text search system that stores documents and token information separately in a distributed database, improving scalability and performance over traditional Lucene‑based solutions.
Preface
Open‑source full‑text search projects such as Lucene are popular, but real‑world deployments usually require multiple Lucene instances (e.g., Solr, Elasticsearch). These systems store documents and their token information together, forcing a query to scan all nodes to obtain complete results.
To avoid this, the storage and distribution strategy for documents and tokens must be separated. By using a distributed database to store token information independently, a full‑text search system can be built on top of the database instead of Lucene.
Design Idea
Store data in a distributed database rather than Lucene, using three tables: a document table, a token table (post‑analysis words), and a corpus table (basic token information).
Document and token tables use different sharding keys for precise location: document_id for the document table, term_id for the token table.
SQL is used to express logical operators (AND, OR, NOT) in search conditions.
Distributed Table Design
Document Table
Chinese name
Document table
English name
tab_doc
Sharding key
doc_id
Remark
Stores document metadata; the document content can indicate whether the document is new, updated, or deleted.
Token Table
Chinese name
Token table
English name
tab_term
Sharding key
term_id
Remark
Stores token information.
Corpus Table
Chinese name
Corpus table
English name
tab_corpus
Sharding key
term_id
Remark
Stores inverse document frequency for each token.
Architecture
Document Indexing Process
1. Multi‑threaded download
Obtain document_id from the URL, check the document table; if no record, insert an initial row and set download status to 1, then start downloading. On success, update status to 2 and record download time; on failure, revert to status 0. If a record exists, the current status determines whether to skip, retry, or re‑download based on the last download time.
2. Multi‑threaded analysis and storage
Read doc_parse_status; if 0, set to 1 and start analysis. On success, set status to 2, store the analyzed content, and insert token results (term_id, term_name, doc_id, term_freq) into the token table. Failure resets status to 0. Status 1 means skip, status 2 may trigger re‑analysis after a defined interval.
Document Search Process
Tokenize the user query, derive logical relationships (AND, OR, NOT) and search each token in the corresponding token table.
Combine document IDs according to the logical relationships, retrieve document contents, and compute a score for each document as the sum of (term frequency × inverse document frequency) for all query tokens.
Rank documents by score and return the highest‑scoring results.
Example
Assumptions
100 nodes (0‑99)
10 million crawled documents
Corpus built from 1 billion documents, containing 1 trillion token records
Document ID = sum of four 32‑bit integers of the URL’s MD5
Token ID = sum of four 32‑bit integers of the token’s MD5
Three sample documents and eight sample tokens are used for demonstration.
Indexing
Document table data:
doc_id
doc_content
node_id
8913640109
http://www.infoq.com/.../machine-expert-not-value content
9
9783095263
http://www.infoq.com/.../transform-machine-learn-70 content
63
4104526513
http://www.infoq.com/.../Big-data-machine-learning-2016 content
13
Token table data (sample):
term_id
term
doc_id
freq
7426744410
机器学习
8913640109
50
10471356074
数据
8913640109
25
11125465229
问题
8913640109
22
14126505064
时间
8913640109
9
Corpus table (inverse document frequency):
term_id
term
idf
7426744410
机器学习
9.9035
10471356074
数据
6.7254
11125465229
问题
5.9915
14126505064
时间
5.7446
Search Example
User query: “机器学习 开源”. The system tokenizes to two terms with an AND relationship, retrieves matching rows from the token table, merges document IDs, computes scores (freq × idf), and ranks the documents.
Resulting scores:
doc_id
score
8913640109
495.175
4104526513
419.2228
9783095263
364.3631
The corresponding document contents are then returned to the user with highlighted query terms.
Conclusion
Current popular frameworks (Solr, Elasticsearch) store data in inverted indexes distributed across multiple index instances, requiring queries to hit every node.
Using a distributed database separates document and token storage, allowing precise node targeting and improving performance.
The architecture decouples download, analysis, storage, and query modules, providing configurability and flexibility.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.