Artificial Intelligence 15 min read

A Survey of Entity Linking: Definitions, Methods, and Applications

This article provides a comprehensive overview of entity linking, detailing its definition, the two-stage pipeline of entity recognition and disambiguation, common methodologies such as candidate generation and ranking, advanced approaches, challenges like unlinkable mentions, and various applications in knowledge graphs, text mining, and question answering.

DataFunTalk
DataFunTalk
DataFunTalk
A Survey of Entity Linking: Definitions, Methods, and Applications

1. Task Definition

First, we clarify what Entity Linking (EL) is and its goal. EL aims to link a textual mention to an entity in a knowledge graph. The mention is the text span, the entity is the target node. EL actually comprises two sub‑tasks: Entity Recognition (ER) and Entity Disambiguation (ED). Many recent works assume the mentions are given, but strictly EL includes both stages.

2. General Methodology

Since 2015, neural‑based EL methods have proliferated. Early approaches treated EL as a massive multi‑class classification problem, which does not scale to knowledge graphs with billions of entities. The field shifted to a candidate generation + ranking pipeline, which is now the standard architecture (illustrated below).

The pipeline consists of two basic steps: entity recognition followed by entity disambiguation.

1. Entity Recognition Model: identifies mentions such as “Scott Young”.

2. Entity Disambiguation Model: links each mention to a knowledge‑graph entity. This stage is further divided into Candidate Generation and Entity Ranking. Candidate Generation produces a shortlist of possible entities for each mention; Entity Ranking scores and orders these candidates, similar to recall, coarse‑ranking, and fine‑ranking in recommender systems.

Key point: Not every mention can be linked; some mentions are “unlinkable”, leading to the NIL prediction problem.

3. Candidate Generation

EL resembles Word Sense Disambiguation (WSD) but lacks a fixed sense inventory, making candidate generation crucial. Three main strategies are used:

1) Surface‑form matching (hard matching) using edit distance, BM25, n‑grams, etc., which can handle abbreviations like “BAT” or “TMD”.

2) Name dictionaries (alias tables) built from Wikipedia redirects, manually curated synonyms, or other resources; a strong dictionary yields high recall.

3) Prior probability estimation p(e|m) computed from Wikipedia hyperlink statistics or resources like CrossWikis.

4. Entity Ranking

After candidates are generated, a ranking model selects the most appropriate entity for each mention. Features include mention representation (contextual embeddings from LSTM, self‑attention, Transformers, BERT, etc.) and entity representation (word2vec, graph embeddings such as DeepWalk or TransE, BERT‑based encodings). The final score may combine similarity measures (dot product, cosine) with additional signals like graph features, distance metrics, and link count.

5. Handling Unlinkable Mentions

Several strategies address mentions that cannot be linked:

1) Thresholding: set a confidence threshold below which a mention is labeled NIL.

2) Introducing a NIL entity into the ranking pool.

3) Training a binary classifier to predict linkability after ranking.

6. Advanced Methodologies

Beyond the basic pipeline, researchers explore:

1) Joint learning models that combine recognition and disambiguation to avoid error propagation.

2) Global context modeling, where decisions for one mention influence others, enforcing coherence across the document.

3) Domain‑specific adaptations, including semi‑supervised or zero‑shot techniques for low‑resource domains.

4) Cross‑lingual EL, leveraging high‑resource language data to improve performance on low‑resource languages, often with zero‑shot methods.

7. Applications and Outlook

EL is useful in several downstream tasks:

• Knowledge‑graph population – enriching KG nodes with textual mentions.

• Text mining – grounding ambiguous terms to canonical concepts, especially in biomedical texts.

• Information retrieval – augmenting queries with entity semantics.

• Question answering – linking user questions to KG entities to retrieve precise answers.

• Representation learning – integrating entity knowledge into pre‑trained language models (e.g., ERNIE).

8. Recent Conference Papers

The author compiled a list of notable EL papers from top conferences over the past five years; the list is available via the provided link.

For further discussion, the author invites readers to join the “AI Natural Language Processing and Knowledge Graph” WeChat group.

References

Neural Entity Linking: A Survey of Models Based on Deep Learning

Joint learning of named entity recognition and entity linking

Investigating entity knowledge in BERT with simple neural end-to-end entity linking.

Autoregressive Entity Retrieval

Zero‑shot entity linking by reading entity descriptions.

Scalable zero‑shot entity linking with dense entity retrieval

Overview of TAC‑KBP2015 tri‑lingual entity discovery and linking

Neural cross‑lingual entity linking

ERNIE: Enhanced Representation through Knowledge Integration

Deep LearningNatural Language Processinginformation retrievalKnowledge Graphentity linking
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.