How RAG‑Powered AI Boosted Government Data Labeling Efficiency by 5×
This case study details how a government‑focused AI system using retrieval‑augmented generation (RAG) and advanced preprocessing algorithms increased data labeling speed by up to five times, raised accuracy above 95%, and produced high‑quality enterprise, spatial, and economic datasets.
1. Case Overview
To address insufficient government data processing capacity that undervalues data and hampers deep mining and reuse, this case focuses on the industrial‑economy sector. Leveraging domain‑specific intelligent agents and a Retrieval‑Augmented Generation (RAG) library, an automatic labeling system was built, improving overall labeling efficiency by 10‑15% and accuracy to over 95%, resulting in high‑quality datasets for enterprises, spaces, and economics.
2. Measures and Results
1) Data preprocessing empowered by small algorithms: Using smoothing, mean imputation, interpolation, GANs, Z‑score, and Local Outlier Factor, initial data cleaning was performed. For enterprise data, missing‑value completion reached 92%, conflict detection 100%, and anomaly handling about 85%.
2) Large‑model‑supported data relationship construction: Cleaned data were combined with reports, policies, and documents to build entity recognition and linking capabilities based on RAG. Entities such as enterprises and spaces were extracted and linked to a knowledge graph, uncovering hidden complex relationships. Relationship extraction succeeded in roughly 65% of cases and, after human intervention and knowledge training, rose to 80%.
3) Building an autonomous data‑labeling intelligent agent: The agent automates the labeling workflow. By integrating entity relationships from the RAG library, it automatically labels various entities, relationships, and attributes, increasing processing efficiency fivefold. For enterprise data, the agent reduced a 41‑step labeling process to a single automated run.
4) Automated data quality verification: Post‑labeling, cross‑validation and multi‑round review feedback were applied, achieving 100% verification coverage.
3. Highlights
1) New technology reduces labor and time: Applying RAG to identify government data entities, relationships, and attributes enables a task that previously required 30 person‑months to be completed by about five assistants within two months.
2) Significant practical impact: The technology has been successfully applied across multiple industrial‑economy departments, with regional autonomous labeling systems built within two months, markedly shortening deployment cycles.
3) Improved labeling accuracy: The intelligent agent’s automatic labeling yields more accurate and consistent results, raising labeling accuracy from the traditional 83% to over 97%.
4) Generation of high‑quality datasets: Enterprise datasets merged over 1,000 tables, labeling nearly 400,000 enterprises and about 2 billion records; spatial datasets resolved inconsistent address descriptions across sources; economic datasets integrated resources from nearly ten departments, creating comprehensive high‑quality economic data collections.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.