Databases 20 min read

Recent Advances in Vector Databases Presented at SIGMOD 2024

This article reviews the latest vector database research showcased at SIGMOD 2024, covering system designs such as Starling, Vexless, RaBitQ, and ACORN, and discusses current academic hotspots including query processing, index structures, optimization techniques, and hardware acceleration for large‑scale similarity search.

AntData
AntData
AntData
Recent Advances in Vector Databases Presented at SIGMOD 2024

Author Bio: Shen Zhitiao, Ph.D., senior technical expert at Ant Group's Data Department, leads the memory database TBase project and serves as the technical officer of the storage team, with prior experience at Cisco and Ant Group, focusing on translating cutting‑edge research into practice.

From June 9‑14, 2024, the premier database conference SIGMOD was held in Chile. Against the backdrop of rapid AI and large language model development, vector databases and retrieval have become a focal point, especially with the rise of Retrieval‑Augmented Generation (RAG). This article introduces the newest vector database research from SIGMOD 2024 and analyzes future directions in the field.

1. Current Development of Vector Databases

With the explosion of unstructured data and the reliance of large models on vector retrieval, researchers are exploring complex scenarios for vector search. Classic vector retrieval has many index algorithms (graph, table, tree), but mixed, batch, and hardware‑dependent queries lack mature solutions. The figure illustrates how scenario complexity and hardware affect algorithm performance.

2. SIGMOD 2024 Vector‑Related Works

Starling: Disk‑Based Vector Retrieval

Starling, proposed by Milvus, optimizes DiskANN by (1) improving disk layout to increase I/O hits and (2) pre‑selecting entry points to reduce I/O operations. It groups nearby vectors in the same disk block, storing an ID‑to‑block map in memory, which loads neighboring vectors together, and selects suitable entry points to further cut I/O.

Vexless: Low‑Cost Vector Service on Cloud Functions

Vexless is a vector database designed for Cloud Function environments, targeting bursty and sparse workloads. It uses a global coordinator for workload distribution and sharding, employs stateful cloud functions to reduce synchronization overhead, and introduces workload‑aware lifecycle management to optimize cold‑start performance. Implemented on Azure Functions, experiments show cost‑effectiveness and elasticity for bursty, sparse workloads, offering solutions for small‑scale, low‑cost applications.

RaBitQ: Efficient Vector Quantization

RaBitQ, from NTU, compresses D‑dimensional vectors to D‑bit strings and uses an estimator to approximate distances to base data. Compression normalizes vectors, applies random orthogonal rotation, and stores the sign of the product as the quantized vector. During query, the same steps are applied, and distance estimation reduces Euclidean distance computation to a few XOR operations, achieving up to three‑fold QPS improvement over PQ on IVF with comparable recall.

ACORN: Predicate‑Aware Search on HNSW

ACORN extends HNSW with a pruning rule—if a is neighbor of b and b neighbor of c, then a should not be neighbor of c—and introduces a 2‑hop expansion that filters candidates by predicates during each expansion, yielding larger search horizons and higher throughput at the same recall.

3. Current Academic Research Hotspots

Beyond the works above, research on vectors can be grouped into several areas:

Query Processing

Distance Measures

Research explores end‑to‑end training to maximize object recall rather than vector recall, and designing distance metrics that best serve object retrieval, including non‑standard and asymmetric metrics.

Query Types

Model‑embedding queries integrate model inference with vector search. Nearest‑neighbor (K‑NN) and range queries are the main modes; K‑NN varies with K size, while range queries can exploit triangle inequality for pruning. Mixed queries combine vector and scalar conditions, and batch/multi‑vector queries address scenarios like face recognition or multimodal retrieval.

Index Structures

Table‑Based Indexes

These map high‑dimensional vectors to lower dimensions using random (LSH) or learned methods (IVF, PQ, SQ).

Tree‑Based Indexes

Recursive space partitioning (e.g., KD‑tree) works well for low dimensions; random‑projection trees handle higher dimensions but face index‑size challenges.

Graph‑Based Indexes

Include k‑NN graphs, monotonic search networks (MSN), and small‑world graphs (NSW). Recent work focuses on efficient graph paradigms and edge‑pruning strategies such as RNG, MRNG, NSG, and Vamana.

Query Optimization & Execution

Hybrid Query Execution

Techniques include brute‑force, pre‑filtering, single‑stage, and post‑filtering, each suited to different selectivity scenarios. Systems like KGraph and ACORN provide predicate‑aware structures to accelerate hybrid queries.

Hardware‑Related Optimizations

Distance calculations can be accelerated with GPUs or SIMD. Storage optimizations involve distributed sharding or disk‑based solutions (DFS) for massive datasets.

Distance Computation Optimizations

Methods such as distance decomposition, early termination, scalar quantization, product quantization, and distance‑operator (DCO) techniques reduce computational overhead while preserving accuracy.

Recruitment Note: Ant Group's Data Department is seeking experts in vector retrieval and databases; resumes can be sent to [email protected].

4. References

1. Pan, James Jie, Jianguo Wang, and Guoliang Li. "Vector Database Management Techniques and Systems." Companion of the 2024 International Conference on Management of Data . 2024. 2. Wang, Mengzhao, et al. "Starling: An I/O‑Efficient Disk‑Resident Graph Index Framework for High‑Dimensional Vector Similarity Search on Data Segment." Proceedings of the ACM on Management of Data 2.1 (2024): 1‑27. 3. Su, Yongye, et al. "Vexless: A Serverless Vector Data Management System Using Cloud Functions." Proceedings of the ACM on Management of Data 2.3 (2024): 1‑26. 4. Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High‑Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1‑27. 5. Patel, Liana, et al. "ACORN: Performant and Predicate‑Agnostic Search Over Vector Embeddings and Structured Data." Proceedings of the ACM on Management of Data 2.3 (2024): 1‑27.
Big DataAIIndexingvector databaseretrievalSIGMOD 2024
AntData
Written by

AntData

Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.