Pretrained Models for First-Stage Information Retrieval: A Comprehensive Review
This comprehensive review by Dr. Fan Yixing surveys how pretrained language models have transformed first‑stage information retrieval, tracing the shift from traditional term‑based methods to neural sparse, dense, and hybrid approaches, and discussing key challenges such as hard‑negative mining, joint indexing‑representation learning, and generative‑discriminative training.
Overview: In recent years, pretrained models have achieved great success in various NLP tasks and have also made significant progress in information retrieval (IR). This article, presented by Dr. Fan Yixing from the Chinese Academy of Sciences, focuses on the application of pretrained models in the first-stage (recall) of IR and provides a systematic review of recent research.
1. Development of Information Retrieval
Three perspectives: (1) relevance measurement between query and document, (2) efficiency of retrieving and ranking from large corpora, (3) system-level issues such as ambiguous intent, noisy input, heterogeneous document structures.
Evolution of relevance modeling: traditional models → Learning-to-Rank (LTR) → Neural IR (NeuIR). Traditional models (e.g., BM25, Language Model) rely on term overlap and weighting. LTR introduces feature learning (e.g., RankNet, LambdaMart). NeuIR further learns representations via neural networks, including representation-based, interaction-based, and hybrid methods.
Retrieval frameworks have moved from single‑stage to multi‑stage pipelines. Early engines (Indri, ES) used BM25/LM; later stages incorporate L2R and neural components, often placed in a reranker.
System architectures have shifted from symbolic (sparse inverted indexes) to vector‑based (dense embeddings) and increasingly combine both.
2. Pretrained Models in First‑Stage Retrieval
Term‑based models suffer from vocabulary mismatch and loss of semantic dependencies. Three main remedies are Sparse Retrieval, Dense Retrieval, and Hybrid Retrieval.
Sparse Retrieval: Keeps sparsity and integrates with inverted indexes. Examples: DeepCT (re‑weights term importance using BERT), HDCT, doc2query/docTTTTTquery (document expansion with generated queries), UED, SparTerm, SPLADE.
Dense Retrieval: Maps queries and documents into a semantic space and uses ANN search. Examples: ColBERT (interaction‑based), RepBERT, DPR, TCT‑ColBERT (distillation), ME‑BERT, Multi‑vector approaches.
Hybrid Retrieval: Combines sparse and dense signals (e.g., CLEAR, COIL).
3. Challenges in First‑Stage Retrieval
Negative sample mining: methods such as ANCE, RocketQA (denoised), TAS‑Balanced improve the quality of hard negatives.
Joint learning of indexing and representation: recent work from JD integrates product quantization gradients into representation learning.
Joint generation‑discrimination models: approaches that train both a discriminative relevance model and a generative query‑generation model (e.g., Mixed Attention).
Guest Introduction
Dr. Fan Yixing, associate researcher at the Institute of Computing Technology, Chinese Academy of Sciences, focuses on IR and NLP, has published over 30 papers in top conferences, and developed the MatchZoo toolkit.
Recommended Reading
Links to related articles on large‑scale log retrieval, intelligent video editing, and personalized short‑video push.
End‑of‑Article Promotion
Readers are invited to comment for a chance to win an iQIYI monthly membership card.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.