Artificial Intelligence 13 min read

Chinese Short‑Text Entity Linking: Model Design, Multitask Learning, and Experimental Results on the Qianyan Dataset

This article presents a comprehensive approach to Chinese short‑text entity linking, describing the Qianyan dataset, pipeline and end‑to‑end task formulations, sample construction, a multitask model that jointly performs entity ranking and NIL classification, various optimization techniques including confidence learning and adversarial training, and detailed experimental analysis showing state‑of‑the‑art performance.

DataFunSummit
DataFunSummit
DataFunSummit
Chinese Short‑Text Entity Linking: Model Design, Multitask Learning, and Experimental Results on the Qianyan Dataset

The article introduces entity linking for Chinese short texts, where mentions in queries, posts, or titles are linked to entities in a knowledge base, with two task designs: a pipeline (NER, candidate selection, disambiguation) and an end‑to‑end model.

The Qianyan dataset, released by Baidu PaddlePaddle, contains 70 k training examples (average query length 22), 260 k mentions, and a high proportion of NIL entities, posing challenges such as short context, many candidates, and abundant NIL cases.

To address these challenges, the authors construct three types of samples: (1) query samples that embed mention position using special delimiters, (2) entity description samples that concatenate the mention with the canonical entity name and type, and (3) statistical feature samples that embed entity length, mention length, Jaccard similarity, etc., and concatenate them with the model output.

The proposed model concatenates the query and entity description, feeds them into a pretrained language model (ERNIE, RoBERTa‑Large, or BERT), and extracts the CLS token together with vectors at the mention start and end positions. A pointwise ranking head outputs a score in [-1, 1] for each candidate, while a classification head predicts the NIL type via a softmax layer. The two heads share parameters in a multitask framework.

Optimization techniques include confidence‑learning based data cleaning (using five‑fold model ensembles to relabel noisy samples), three NIL‑handling strategies (threshold‑based ranking, explicit NIL sample construction, and a separate NIL classifier), and adversarial training (FGM and PGD) to improve robustness.

Experiments compare different pretrained models and training tricks. RoBERTa‑Large outperforms ERNIE + confidence learning, which in turn beats ERNIE and BERT. Multitask learning consistently surpasses single‑task baselines. Adversarial training yields modest gains across models, with FGM and PGD showing similar performance.

Final ensemble models achieve an F1 of 88.7 on the dev set, 88.63 on leaderboard A, and 91.20 on leaderboard B, ranking second overall. The authors also discuss practical deployment in the OPPO XiaoBu assistant, where the techniques help resolve ambiguous user queries such as “Who is Li Bai?” or “Play Li Bai”.

adversarial trainingChinese NLPpretrained language modelsmultitask learningentity linkingconfidence learningentity disambiguation
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.