Artificial Intelligence 15 min read

Pretrained Models for First-Stage Information Retrieval: A Comprehensive Review

This comprehensive review by Dr. Fan Yixing surveys how pretrained language models have transformed first‑stage information retrieval, tracing the shift from traditional term‑based methods to neural sparse, dense, and hybrid approaches, and discussing key challenges such as hard‑negative mining, joint indexing‑representation learning, and generative‑discriminative training.

Baidu Geek Talk

Nov 29, 2021

Pretrained Models for First-Stage Information Retrieval: A Comprehensive Review

Overview: In recent years, pretrained models have achieved great success in various NLP tasks and have also made significant progress in information retrieval (IR). This article, presented by Dr. Fan Yixing from the Chinese Academy of Sciences, focuses on the application of pretrained models in the first-stage (recall) of IR and provides a systematic review of recent research.

1. Development of Information Retrieval

Three perspectives: (1) relevance measurement between query and document, (2) efficiency of retrieving and ranking from large corpora, (3) system-level issues such as ambiguous intent, noisy input, heterogeneous document structures.

Evolution of relevance modeling: traditional models → Learning-to-Rank (LTR) → Neural IR (NeuIR). Traditional models (e.g., BM25, Language Model) rely on term overlap and weighting. LTR introduces feature learning (e.g., RankNet, LambdaMart). NeuIR further learns representations via neural networks, including representation-based, interaction-based, and hybrid methods.

Retrieval frameworks have moved from single‑stage to multi‑stage pipelines. Early engines (Indri, ES) used BM25/LM; later stages incorporate L2R and neural components, often placed in a reranker.

System architectures have shifted from symbolic (sparse inverted indexes) to vector‑based (dense embeddings) and increasingly combine both.

2. Pretrained Models in First‑Stage Retrieval

Term‑based models suffer from vocabulary mismatch and loss of semantic dependencies. Three main remedies are Sparse Retrieval, Dense Retrieval, and Hybrid Retrieval.

Sparse Retrieval: Keeps sparsity and integrates with inverted indexes. Examples: DeepCT (re‑weights term importance using BERT), HDCT, doc2query/docTTTTTquery (document expansion with generated queries), UED, SparTerm, SPLADE.

Dense Retrieval: Maps queries and documents into a semantic space and uses ANN search. Examples: ColBERT (interaction‑based), RepBERT, DPR, TCT‑ColBERT (distillation), ME‑BERT, Multi‑vector approaches.

Hybrid Retrieval: Combines sparse and dense signals (e.g., CLEAR, COIL).

3. Challenges in First‑Stage Retrieval

Negative sample mining: methods such as ANCE, RocketQA (denoised), TAS‑Balanced improve the quality of hard negatives.

Joint learning of indexing and representation: recent work from JD integrates product quantization gradients into representation learning.

Joint generation‑discrimination models: approaches that train both a discriminative relevance model and a generative query‑generation model (e.g., Mixed Attention).

Guest Introduction

Dr. Fan Yixing, associate researcher at the Institute of Computing Technology, Chinese Academy of Sciences, focuses on IR and NLP, has published over 30 papers in top conferences, and developed the MatchZoo toolkit.

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.