Artificial Intelligence 10 min read

Vector Retrieval for Community Forum Search Using Milvus at Dingxiangyuan

This article describes how Dingxiangyuan's algorithm team adopted Milvus for distributed vector indexing to improve semantic search in their community forum, detailing the background, retrieval workflow, various embedding models—including Bi‑Encoder, Spherical Embedding, and Knowledge Embedding—and summarizing the benefits and future applications.

DataFunTalk
DataFunTalk
DataFunTalk
Vector Retrieval for Community Forum Search Using Milvus at Dingxiangyuan

Background

With the development of models such as BERT and GNN, the semantic extraction capability of DNN models has improved, raising expectations for text semantic vectors to play a larger role in recall. Around early 2019, we tried Faiss‑based vector recall in recommendation, but its lack of distributed solutions and data persistence made it unsuitable for large‑scale scenarios. Milvus addressed these gaps with distributed architecture, persistent storage, rich SDKs, and an active community, leading us to adopt Milvus as the vector index component for the community forum search in 2020.

Recall Process

Typical search recall relies on components like Solr/Elasticsearch using probabilistic models such as BM25. In Dingxiangyuan's forum, queries are often vague and cover professional knowledge, exam/job seeking, and news topics, making BM25 insufficient for handling fuzzy and complex semantics. Therefore, we explored various text vectorization models to improve keyword and semantic recall.

Vector Models

Bi‑Encoder[1]

BERT is a typical Cross‑Encoder; its Bi‑Encoder structure incurs performance loss, but for vector recall we must use Bi‑Encoder‑type models that pre‑encode candidate documents and compute only the query vector online.

The choice of loss function and negative‑sampling strategy is crucial; we use triplet loss, dividing negative samples into easy, middle, and hard parts with a 2:2:1 ratio.

Spherical Embedding[2]

Forum topics are highly imbalanced; professional keyword content is less frequent than topical content, making it hard for BERT alone to capture these keywords. We first considered word2vec‑type models; the Spherical Text Embedding can learn document and word vectors simultaneously, showing good performance on keyword and document capture.

Traditional word2vec focuses on the relationship between a center word and its context, ignoring the whole document. The authors propose generating word vectors based on the document vector: first generate a center word vector tgt using the document vector as the mean, then generate surrounding word vectors src with tgt as the mean.

The three vectors doc , tgt , and src follow a von Mises‑Fisher (vMF) distribution, and the joint probability based on doc is used in training.

Training follows a max‑margin loss similar to word2vec, maximizing the distance between positive and negative samples.

Knowledge Embedding[3]

A more direct keyword vectorization method leverages existing graph structures using models like TranE or node2vec; we use ProNE as an example. ProNE consists of two parts: first, sparse matrix factorization of the undirected graph adjacency matrix, defining a loss that maximizes co‑occurrence frequency of node r and neighbor c while minimizing negative sampling probability.

Minimizing the loss leads to a distance formula between node r and neighbor c , from which a distance matrix is constructed.

Applying truncated SVD (tSVD) to the distance matrix and taking the top‑d singular values yields node embeddings; a randomized tSVD accelerates this step. The second part adopts a GCN‑style propagation strategy to capture local smoothing and global clustering information, using Cheeger constants to relate subgraph cohesion to eigenvalues.

The Cheeger constant bounds eigenvalues; smaller constants correspond to smaller eigenvalues and tighter clustering, while larger constants indicate smoother subgraphs. A function g(λ) controls the eigenvalue range, and a GCN‑like approximation using Chebyshev polynomials computes a modified Laplacian \widetilde{L} . Finally, an additional SVD ensures orthogonality of the embeddings.

Conclusion

This paper outlines Dingxiangyuan's exploration of vector‑based recall in search, progressively handling complex semantic expressions with various embedding models. Leveraging Milvus provides strong performance and mature deployment solutions, allowing us to focus on model tuning. We also apply Milvus in other scenarios, such as converting long texts to binary vectors for Hamming‑distance queries. Milvus's comprehensive SDK enables rapid deployment across teams. Vector recall is becoming a trend in NLP and recommendation, and Milvus's reliability and technical support significantly improve development and deployment efficiency. We continue to introduce vector representation models in more business lines.

recommendationMilvusVector SearchembeddingNLPsemantic retrieval
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.