Recent Advances in Vector Databases Presented at SIGMOD 2024
This article reviews the latest vector database research showcased at SIGMOD 2024, covering system designs such as Starling, Vexless, RaBitQ, and ACORN, and discusses current academic hotspots including query processing, index structures, optimization techniques, and hardware acceleration for large‑scale similarity search.
Author Bio: Shen Zhitiao, Ph.D., senior technical expert at Ant Group's Data Department, leads the memory database TBase project and serves as the technical officer of the storage team, with prior experience at Cisco and Ant Group, focusing on translating cutting‑edge research into practice.
From June 9‑14, 2024, the premier database conference SIGMOD was held in Chile. Against the backdrop of rapid AI and large language model development, vector databases and retrieval have become a focal point, especially with the rise of Retrieval‑Augmented Generation (RAG). This article introduces the newest vector database research from SIGMOD 2024 and analyzes future directions in the field.
1. Current Development of Vector Databases
With the explosion of unstructured data and the reliance of large models on vector retrieval, researchers are exploring complex scenarios for vector search. Classic vector retrieval has many index algorithms (graph, table, tree), but mixed, batch, and hardware‑dependent queries lack mature solutions. The figure illustrates how scenario complexity and hardware affect algorithm performance.
2. SIGMOD 2024 Vector‑Related Works
Starling: Disk‑Based Vector Retrieval
Starling, proposed by Milvus, optimizes DiskANN by (1) improving disk layout to increase I/O hits and (2) pre‑selecting entry points to reduce I/O operations. It groups nearby vectors in the same disk block, storing an ID‑to‑block map in memory, which loads neighboring vectors together, and selects suitable entry points to further cut I/O.
Vexless: Low‑Cost Vector Service on Cloud Functions
Vexless is a vector database designed for Cloud Function environments, targeting bursty and sparse workloads. It uses a global coordinator for workload distribution and sharding, employs stateful cloud functions to reduce synchronization overhead, and introduces workload‑aware lifecycle management to optimize cold‑start performance. Implemented on Azure Functions, experiments show cost‑effectiveness and elasticity for bursty, sparse workloads, offering solutions for small‑scale, low‑cost applications.
RaBitQ: Efficient Vector Quantization
RaBitQ, from NTU, compresses D‑dimensional vectors to D‑bit strings and uses an estimator to approximate distances to base data. Compression normalizes vectors, applies random orthogonal rotation, and stores the sign of the product as the quantized vector. During query, the same steps are applied, and distance estimation reduces Euclidean distance computation to a few XOR operations, achieving up to three‑fold QPS improvement over PQ on IVF with comparable recall.
ACORN: Predicate‑Aware Search on HNSW
ACORN extends HNSW with a pruning rule—if a is neighbor of b and b neighbor of c, then a should not be neighbor of c—and introduces a 2‑hop expansion that filters candidates by predicates during each expansion, yielding larger search horizons and higher throughput at the same recall.
3. Current Academic Research Hotspots
Beyond the works above, research on vectors can be grouped into several areas:
Query Processing
Distance Measures
Research explores end‑to‑end training to maximize object recall rather than vector recall, and designing distance metrics that best serve object retrieval, including non‑standard and asymmetric metrics.
Query Types
Model‑embedding queries integrate model inference with vector search. Nearest‑neighbor (K‑NN) and range queries are the main modes; K‑NN varies with K size, while range queries can exploit triangle inequality for pruning. Mixed queries combine vector and scalar conditions, and batch/multi‑vector queries address scenarios like face recognition or multimodal retrieval.
Index Structures
Table‑Based Indexes
These map high‑dimensional vectors to lower dimensions using random (LSH) or learned methods (IVF, PQ, SQ).
Tree‑Based Indexes
Recursive space partitioning (e.g., KD‑tree) works well for low dimensions; random‑projection trees handle higher dimensions but face index‑size challenges.
Graph‑Based Indexes
Include k‑NN graphs, monotonic search networks (MSN), and small‑world graphs (NSW). Recent work focuses on efficient graph paradigms and edge‑pruning strategies such as RNG, MRNG, NSG, and Vamana.
Query Optimization & Execution
Hybrid Query Execution
Techniques include brute‑force, pre‑filtering, single‑stage, and post‑filtering, each suited to different selectivity scenarios. Systems like KGraph and ACORN provide predicate‑aware structures to accelerate hybrid queries.
Hardware‑Related Optimizations
Distance calculations can be accelerated with GPUs or SIMD. Storage optimizations involve distributed sharding or disk‑based solutions (DFS) for massive datasets.
Distance Computation Optimizations
Methods such as distance decomposition, early termination, scalar quantization, product quantization, and distance‑operator (DCO) techniques reduce computational overhead while preserving accuracy.
Recruitment Note: Ant Group's Data Department is seeking experts in vector retrieval and databases; resumes can be sent to [email protected].
4. References
1. Pan, James Jie, Jianguo Wang, and Guoliang Li. "Vector Database Management Techniques and Systems." Companion of the 2024 International Conference on Management of Data . 2024. 2. Wang, Mengzhao, et al. "Starling: An I/O‑Efficient Disk‑Resident Graph Index Framework for High‑Dimensional Vector Similarity Search on Data Segment." Proceedings of the ACM on Management of Data 2.1 (2024): 1‑27. 3. Su, Yongye, et al. "Vexless: A Serverless Vector Data Management System Using Cloud Functions." Proceedings of the ACM on Management of Data 2.3 (2024): 1‑26. 4. Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High‑Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1‑27. 5. Patel, Liana, et al. "ACORN: Performant and Predicate‑Agnostic Search Over Vector Embeddings and Structured Data." Proceedings of the ACM on Management of Data 2.3 (2024): 1‑27.
AntData
Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.