Artificial Intelligence 13 min read

Improving Document Search with Vector Search: From Elasticsearch Limitations to Milvus Integration

This article explains how traditional keyword search with Elasticsearch often yields inaccurate or incomplete results for document retrieval, introduces vectorization and semantic search using NLP embeddings, and demonstrates a practical workflow that combines these techniques with the Milvus vector database to achieve more accurate and efficient document search.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Improving Document Search with Vector Search: From Elasticsearch Limitations to Milvus Integration

The author, a front‑end developer, notes the hype around AI and his recent need to build a document search feature, which sparked his interest in applying AI techniques to improve search quality.

Typical document search relies on keyword matching with Elasticsearch; while Elasticsearch is well known, the author identifies two main pain points: inaccurate results and long search latency.

Using a simple example of three Chinese sentences ("我很开心", "我很快乐", "我很高兴"), the author shows how Elasticsearch's match query tokenizes the input into words and scores each document based on token matches, resulting in different _score values (e.g., 4.0 vs 1.21) even though the meanings are similar.

When the query is shortened to a single word "开心", only one document is returned, illustrating the "semantic ambiguity" problem where synonyms like "高兴" or "快乐" are missed by pure keyword search.

To address this, the author proposes text vectorization: converting words into dense vectors (e.g., mapping "开心", "高兴", "快乐" to the same vector A ) so that semantically similar terms are close in vector space, enabling retrieval of all relevant documents regardless of the exact keyword used.

He briefly explains what vectors are, using 2‑D and 3‑D coordinate examples ( [0,0] , [1,1] , [0,0,0] ) and likening them to arrays, emphasizing that a vector simply represents a point in an n‑dimensional space.

The vectorization process relies on an NLP embedding model that transforms natural‑language text into an n‑dimensional vector such as [0.xx, 1.xx, ..., 3.xx] . These embeddings capture semantic similarity, allowing "开心" and its synonyms to be retrieved together.

For storing and searching these vectors, the author selects Milvus, a dedicated vector database that can index vectors and compute similarity distances, distinguishing it from traditional relational databases.

The proposed vector search workflow includes: (1) preparing an NLP model for embedding, (2) processing document metadata (splitting, classification, cleaning), (3) vectorizing documents and inserting them into Milvus, (4) embedding the query and performing a vector similarity search in Milvus, and (5) post‑processing and ranking the results according to business needs.

In conclusion, while Elasticsearch remains powerful for keyword search, its inability to handle semantic fuzzy scenarios makes vector search with AI models and Milvus a compelling alternative for more accurate document retrieval, and the author encourages further exploration of these techniques.

AIElasticsearchMilvusVector SearchNLPSemantic Searchdocument retrieval
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.