Design and Optimization of an In-Memory Search Engine
This article shares the author's exploration of in‑memory search engine design, covering system understanding, core architecture, thread and task models, intersection algorithms, lookup optimizations, and componentization, aiming to fill the scarce documentation on memory‑based retrieval engines.
The author, Kaelhua, writes about the lack of public material on in‑memory retrieval engine design and presents his experience from the ZeroSearch project, assuming readers already understand basic search concepts such as inverted indexes and scoring.
System Understanding : The article starts by contrasting disk‑based search engines (e.g., Elasticsearch) with in‑memory engines, highlighting the shift from IO‑bound to compute‑bound workloads and the need for effective task scheduling.
Core Design : It describes the index‑sharding and library model, the decision to avoid embedding an RPC framework, and the prioritization of usability over raw performance. The design emphasizes a simple library interface with an initialization function and a single retrieval entry point.
Thread Model : The current thread model uses five thread pools (main, JoinThreadPool, ScoreThreadPool, CleanThreadPool, and a resource pool). Each search request creates intersection and scoring tasks, which are scheduled across these pools. The author discusses drawbacks such as configuration complexity, potential thread starvation, and uneven task costs.
Potential improvements include:
Maintaining a single FIFO queue for all tasks.
Prioritizing lighter tasks (clean → L1 → L2 → intersection) via a priority queue.
Adopting time‑slice scheduling where tasks are re‑queued after a fixed quantum.
Task Model : The task model separates task creation from execution. Intersection tasks produce L1 scoring tasks, which in turn generate L2 scoring tasks once thresholds are met. The pipeline aims to increase CPU utilization without changing overall throughput.
Intersection Design : The article explains a three‑stage pipeline (intersection, L1 scoring, L2 scoring) and presents a logical‑post‑hoc intersection algorithm that selects a candidate baseline N and advances all posting lists to the first entry ≥ N. It also discusses flattening the syntax tree (Cartesian product) and why the original three‑level tree with the post‑hoc algorithm remains the most efficient.
Lookup Algorithm : The author analyzes block‑based inverted lists, short vs. long chains, and proposes using galloping (exponential) search for nearby blocks, combined with binary search within a block. Short chains are processed first; synonym sub‑trees are deferred. For very long chains, bitmap representations are considered.
Chain Classification : Chains are classified as short, long, or ultra‑long based on block count (K = 15). Short chains search only the current block, while long chains examine up to 15 neighboring blocks before falling back to binary search.
Engine Componentization : The retrieval engine is packaged as a library exposing a single int32_t Retrieve(const RetrieveOptions* retrieve_options, void* business_session) function. The design includes a SearcherStage class for lifecycle hooks (pre‑process, post‑process, KPI reporting) and a modular relevance scoring subsystem split into independent classes.
In conclusion, the author emphasizes usability as the primary priority, while still adhering to the fundamental requirements of an in‑memory search engine, and outlines future work such as distributed indexing, quality‑standard evaluation, and further performance optimizations.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.