Artificial Intelligence 24 min read

Advances in Information Extraction: From PLM to LLM Paradigms at Alibaba DAMO Academy

This article reviews Alibaba DAMO Academy's research on information extraction, covering background concepts, PLM-era extraction paradigms, few‑shot extraction techniques, and the emerging LLM‑era approaches, while also sharing practical insights, benchmark results, and future directions.

DataFunSummit
DataFunSummit
DataFunSummit
Advances in Information Extraction: From PLM to LLM Paradigms at Alibaba DAMO Academy

Background

Information extraction (IE) is a classic NLP task that includes sub‑tasks such as entity extraction, fine‑grained entity classification, entity linking, relation extraction, and event extraction. It is widely applied in C‑end, B‑end, and G‑end scenarios, ranging from smart courier forms to medical text processing.

PLM Era Information Extraction Paradigm

The PLM era focuses on improving model performance through advanced algorithms and retrieval‑augmented techniques. Major innovations include implicit‑enhancement, retrieval‑enhancement for short texts, and multimodal extensions. A typical pipeline models IE as a sequence labeling task using a Transformer‑CRF architecture, with experiments showing significant gains across many benchmarks.

Embedding selection (e.g., BERT vs. FLAIR) influences task performance, leading to the ACE (Automatic Concatenation of Embeddings) paradigm that automatically chooses suitable embeddings via a controller‑task model framework.

Few‑Shot Information Extraction Research

To reduce costly annotation, the team proposes graph‑propagation for label transfer, a Partial‑CRF method for label distribution, and a "memory" mechanism that stores source‑model entity representations for optimal transport‑based retrieval, achieving state‑of‑the‑art results on ACL 2023.

LLM Era Information Extraction Paradigm

With large‑scale models (e.g., GPT‑3/4), two directions are explored: (1) prompt engineering and multi‑turn dialogue pipelines (ChatIE) to decompose IE tasks, and (2) training task‑specific LLMs on millions of annotated examples, unifying various IE subtasks and achieving superior performance.

Q&A

Q1: How to filter noise in multimodal image‑text IE? – Use multi‑view learning and KL‑divergence soft‑label alignment.

Q2: How to handle overly long retrieval contexts? – Encode each retrieved document into vectors and apply cross‑attention between BERT tokens and retrieval vectors.

Q3: Can IE and structuring boost general pre‑training? – Yes, by converting text to knowledge graphs or using retrieval‑augmented pre‑training, albeit with higher compute cost.

Conclusion and Outlook

The talk summarizes three themes: (1) PLM‑era algorithmic advances with retrieval‑enhancement, (2) few‑shot IE via data augmentation and model knowledge reuse, and (3) LLM‑era efficient prompting and task‑specific model construction. The speaker emphasizes that IE will remain valuable for end‑to‑end tasks requiring speed, interpretability, and controllability, even as large models evolve.

Resources: https://github.com/Alibaba-NLP/SeqGPT and ModelScope model https://www.modelscope.cn/models/damo/nlp_seqgpt-560m/ .

large language modelsNatural Language ProcessingRetrieval-Augmented Generationfew-shot learningInformation ExtractionAlibaba DAMO
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.