Artificial Intelligence 13 min read

Improving Document Search with Vector Search: From Elasticsearch Limitations to Milvus Integration

This article explains how traditional keyword search with Elasticsearch often yields inaccurate or incomplete results for document retrieval, introduces vectorization and semantic search using NLP embeddings, and demonstrates a practical workflow that combines these techniques with the Milvus vector database to achieve more accurate and efficient document search.

Rare Earth Juejin Tech Community

Mar 22, 2024

Improving Document Search with Vector Search: From Elasticsearch Limitations to Milvus Integration

The author, a front‑end developer, notes the hype around AI and his recent need to build a document search feature, which sparked his interest in applying AI techniques to improve search quality.

Typical document search relies on keyword matching with Elasticsearch; while Elasticsearch is well known, the author identifies two main pain points: inaccurate results and long search latency.

Using a simple example of three Chinese sentences ("我很开心", "我很快乐", "我很高兴"), the author shows how Elasticsearch's match query tokenizes the input into words and scores each document based on token matches, resulting in different _score values (e.g., 4.0 vs 1.21) even though the meanings are similar.

When the query is shortened to a single word "开心", only one document is returned, illustrating the "semantic ambiguity" problem where synonyms like "高兴" or "快乐" are missed by pure keyword search.

To address this, the author proposes text vectorization: converting words into dense vectors (e.g., mapping "开心", "高兴", "快乐" to the same vector A) so that semantically similar terms are close in vector space, enabling retrieval of all relevant documents regardless of the exact keyword used.

He briefly explains what vectors are, using 2‑D and 3‑D coordinate examples ( [0,0], [1,1], [0,0,0]) and likening them to arrays, emphasizing that a vector simply represents a point in an n‑dimensional space.

The vectorization process relies on an NLP embedding model that transforms natural‑language text into an n‑dimensional vector such as [0.xx, 1.xx, ..., 3.xx]. These embeddings capture semantic similarity, allowing "开心" and its synonyms to be retrieved together.

For storing and searching these vectors, the author selects Milvus, a dedicated vector database that can index vectors and compute similarity distances, distinguishing it from traditional relational databases.

The proposed vector search workflow includes: (1) preparing an NLP model for embedding, (2) processing document metadata (splitting, classification, cleaning), (3) vectorizing documents and inserting them into Milvus, (4) embedding the query and performing a vector similarity search in Milvus, and (5) post‑processing and ranking the results according to business needs.

In conclusion, while Elasticsearch remains powerful for keyword search, its inability to handle semantic fuzzy scenarios makes vector search with AI models and Milvus a compelling alternative for more accurate document retrieval, and the author encourages further exploration of these techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Elasticsearch Milvus vector search NLP semantic search document retrieval

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.