TextToKnowledge (解语): Zero‑Shot Chinese Text Knowledge Annotation and Mining Framework
The article introduces TextToKnowledge, an open‑source Baidu platform that provides a unified Chinese term taxonomy (TermTree) and two annotation tools (WordTag and NPTag) to enable zero‑sample text labeling, term‑linking, and downstream knowledge‑mining applications for various NLP tasks.
Overview
TextToKnowledge (解语) is a foundational Chinese text knowledge annotation toolkit released by Baidu. It addresses common industry challenges such as the lack of universal knowledge bases, high annotation costs, and expensive R&D resources.
Platform Introduction
The Baidu Language and Knowledge Technology Open Platform offers a "knowledge middle‑platform" built on knowledge‑graph technology, providing data models, capability engines, and scenario‑customized services. TextToKnowledge is positioned as a low‑coupling, basic Chinese text knowledge annotation toolset that supports knowledge graph construction and NLP model sample optimization.
Knowledge Tree (TermTree)
TermTree is a comprehensive Chinese lexical taxonomy covering concepts, entities, and grammatical words. It organizes words into over 160 term types and 7,000 sub‑types, with a directed acyclic graph structure that enables stable hierarchical inference. The taxonomy is split into a fixed universal concept set and a pluggable entity set, allowing users to customize domain‑specific entities while retaining universal semantics.
Annotation Tools
Two tools are released:
WordTag : Performs sentence‑level word‑class sequence labeling and links recognized terms to the TermTree. It replaces traditional tokenization, POS tagging, and NER pipelines.
NPTag : Labels noun‑phrase categories using a prompt‑learning model trained on Baidu Baike entries, covering over 2,000 fine‑grained categories.
Both tools are accessible via PaddleNLP’s Taskflow API and can be fine‑tuned with a small amount of domain data.
Workflow
1. Build a customized knowledge base by selecting relevant term types from TermTree. 2. Apply WordTag and NPTag to annotate Chinese text and perform Term‑linking. 3. Use the annotated results for template generation, statistical aggregation, or more complex SPO knowledge extraction.
Application Scenarios
The framework supports semantic consistency checking, template‑based information extraction, and feature generation for downstream models. It emphasizes the complementarity of symbolic knowledge and large‑scale pre‑trained language models, improving interpretability, controllability, and computational efficiency.
Future Directions
Further development will focus on expanding relation‑mining models, enhancing domain customization, and tighter integration with pre‑trained models to boost Chinese text processing efficiency.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.