Engineering Practice of Online Vector Recall Service at iQIYI
iQIYI’s engineering team built an online vector‑recall service on Milvus, wrapping it with a Dubbo‑gRPC interface to serve 6 M 64‑dimensional embeddings at roughly 3 k QPS and 20 ms p99 latency, integrating query‑embedding generation, simplifying recommendation pipelines, and demonstrating the performance and operational advantages of a platformized ANN‑based recall layer.
With the rise of deep learning, embedding technology has rapidly advanced, making it feasible to generate recommendation lists directly from embeddings. Leveraging embedding similarity for the recall layer of recommendation systems has become increasingly popular.
The article introduces iQIYI's engineering practice of an online vector recall service, covering background, architecture, engine selection, implementation details, and performance considerations.
Background : Recommendation systems consist of modules such as recommendation pool, user profile, feature engineering, recall, ranking, and strategy. Recall, as the first stage, determines the quality of the candidate set and thus heavily influences overall recommendation performance.
Embedding vectors, enriched by deep learning, provide strong expressive power, low‑dimensional dense representations, and enable similarity calculations. YouTube’s use of embeddings for candidate recall is presented as an example.
Engine Selection : Approximate Nearest Neighbor (ANN) algorithms are the mainstream for large‑scale high‑dimensional vector retrieval. Popular choices include Facebook’s C++ library faiss , as well as open‑source services milvus and vearch . A comparison (Figure 5) shows milvus offers a complete framework for index building, data versioning, and query, while vearch provides a distributed search system with Elasticsearch‑like RESTful APIs.
Considering iQIYI’s Java‑centric recommendation engine, milvus was selected as the underlying vector engine.
Service Architecture : The online vector recall service is built on top of milvus, with a schema for embedding models and a self‑service portal for algorithm engineers to upload models. The service exposes a Dubbo‑wrapped gRPC interface for easy integration with Java services, supporting multi‑data‑center deployment and health checks (Figures 6‑8).
Performance : In a deployment of milvus (2 CPU / 6 GB) and Dubbo query service (4 CPU / 12 GB) with 6 M × 64‑dimensional vectors, the system achieves ~3 k QPS with p99 latency around 20 ms. Issues such as high CPU usage during index building were mitigated by performing index updates on a separate service and keeping two recent data versions online (Figure 7).
Online Inference Integration : To further streamline the workflow, query embedding generation is encapsulated within the service. Initially, a mature ANN service (milvus) was used, but for tighter integration similar to YouTube DNN, the team wrapped an HNSW library provided by iQIYI’s deep‑learning platform, achieving lower CPU load during indexing and better ANN performance (Figure 10).
The final design unifies query embedding inference and top‑N recall, allowing recommendation engineers to focus solely on feature provision without handling embedding generation (Figures 11‑12).
Conclusion and Outlook : The engineering practice demonstrates the benefits of abstracting and platformizing vector recall services: faster service creation, reduced duplication, improved performance, and easier model updates. Future work includes expanding supported embedding generation algorithms, optimizing high‑QPS response, and handling real‑time embedding ingestion.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.