How Qichacha Leverages Large Language Models for Field‑Level Data Lineage

This article details Qichacha's use of large language models to extract field‑level data lineage from heterogeneous, non‑standard code and ETL assets, describing the motivation, architectural blueprint, practical challenges such as cost, accuracy and hallucination, and the resulting improvements in impact analysis, metric tracing, and sensitive‑data governance.

DataFunTalk
DataFunTalk
DataFunTalk
How Qichacha Leverages Large Language Models for Field‑Level Data Lineage

Background: Why Field Lineage Matters

In a complex data‑governance environment with multi‑source, heterogeneous, and massive unstructured data, Qichacha needed to know which downstream tables, metrics, or dashboards would be affected when a source field changes, to avoid disaster‑level omissions during requirement changes, metric anomalies, or sensitive‑data audits.

Why Large Language Models

Traditional lineage tools handle SQL well but struggle with user‑defined functions, custom code, and semantic reasoning, leading to manual, error‑prone work. The team evaluated alternatives such as Apache Calcite (high learning curve, limited UDF coverage), Flink Lineage API (insufficient DataStream and window support), rule‑based extraction (rigid, reactive), and runtime audit logs (post‑hoc). LLMs offer natural‑language and code understanding, multi‑language parsing, and the ability to infer relationships from comments and naming conventions, making them suitable for field‑level lineage.

Core Architecture

The solution consists of four layers:

Data collection & preprocessing : Gather task metadata, SQL scripts, development code, and ETL logs from the data platform; link tasks to owners for rapid responsibility tracing.

LLM parsing engine : Feed the collected artifacts to a unified LLM via curated prompts (Skills), producing structured lineage outputs.

Post‑processing & graph construction : Validate formats, deduplicate, align with a metadata dictionary, and persist field‑level edges to build a searchable lineage graph supporting point‑to‑edge path queries.

Validation & fallback : Apply rule‑based checks, human spot‑checks, and an MCP metadata‑assisted verifier to catch obvious errors and handle low‑confidence results, mitigating hallucinations.

Practice and Challenges

Cost & efficiency : Each LLM call consumes many tokens; the team batches parsing by dimension priority, preprocesses code to trim irrelevant fragments, uses Flink UI‑assisted graph hints, employs asynchronous queues, caches results, and updates only changed code.

Accuracy & evaluation : Model outputs vary; a rule engine automatically filters impossible field names or type mismatches, core paths undergo manual sampling, confidence scores quantify reliability, and multi‑model A/B calls with role‑based prompts improve precision.

Hallucination : The model may fabricate nonexistent field mappings or incorrectly link semantically similar fields. To counter this, the system requires the model to cite specific code lines as reasoning evidence, marks inconsistent parses as “to be confirmed,” and routes them to human review.

Results

Field‑level lineage for thousands of real‑time and tens of thousands of offline tasks changed from “invisible” to “partially visible,” enabling quick downstream impact analysis, metric traceability, and sensitive‑data chain inspection without exhaustive code reviews.

Future Outlook

Planned enhancements include proactive anomaly‑propagation inference, CI/CD integration to surface downstream impact at code‑commit time, and natural‑language query interfaces powered by LLMs to let business users ask questions like “how is this metric calculated,” currently in prototype validation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkLLMdata lineageLarge ModelsData GovernanceQichacha
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.