How Qichacha Leverages Large Language Models for Field‑Level Data Lineage
This article details Qichacha's use of large language models to extract field‑level data lineage from heterogeneous, non‑standard code and ETL assets, describing the motivation, architectural blueprint, practical challenges such as cost, accuracy and hallucination, and the resulting improvements in impact analysis, metric tracing, and sensitive‑data governance.
Background: Why Field Lineage Matters
In a complex data‑governance environment with multi‑source, heterogeneous, and massive unstructured data, Qichacha needed to know which downstream tables, metrics, or dashboards would be affected when a source field changes, to avoid disaster‑level omissions during requirement changes, metric anomalies, or sensitive‑data audits.
Why Large Language Models
Traditional lineage tools handle SQL well but struggle with user‑defined functions, custom code, and semantic reasoning, leading to manual, error‑prone work. The team evaluated alternatives such as Apache Calcite (high learning curve, limited UDF coverage), Flink Lineage API (insufficient DataStream and window support), rule‑based extraction (rigid, reactive), and runtime audit logs (post‑hoc). LLMs offer natural‑language and code understanding, multi‑language parsing, and the ability to infer relationships from comments and naming conventions, making them suitable for field‑level lineage.
Core Architecture
The solution consists of four layers:
Data collection & preprocessing : Gather task metadata, SQL scripts, development code, and ETL logs from the data platform; link tasks to owners for rapid responsibility tracing.
LLM parsing engine : Feed the collected artifacts to a unified LLM via curated prompts (Skills), producing structured lineage outputs.
Post‑processing & graph construction : Validate formats, deduplicate, align with a metadata dictionary, and persist field‑level edges to build a searchable lineage graph supporting point‑to‑edge path queries.
Validation & fallback : Apply rule‑based checks, human spot‑checks, and an MCP metadata‑assisted verifier to catch obvious errors and handle low‑confidence results, mitigating hallucinations.
Practice and Challenges
Cost & efficiency : Each LLM call consumes many tokens; the team batches parsing by dimension priority, preprocesses code to trim irrelevant fragments, uses Flink UI‑assisted graph hints, employs asynchronous queues, caches results, and updates only changed code.
Accuracy & evaluation : Model outputs vary; a rule engine automatically filters impossible field names or type mismatches, core paths undergo manual sampling, confidence scores quantify reliability, and multi‑model A/B calls with role‑based prompts improve precision.
Hallucination : The model may fabricate nonexistent field mappings or incorrectly link semantically similar fields. To counter this, the system requires the model to cite specific code lines as reasoning evidence, marks inconsistent parses as “to be confirmed,” and routes them to human review.
Results
Field‑level lineage for thousands of real‑time and tens of thousands of offline tasks changed from “invisible” to “partially visible,” enabling quick downstream impact analysis, metric traceability, and sensitive‑data chain inspection without exhaustive code reviews.
Future Outlook
Planned enhancements include proactive anomaly‑propagation inference, CI/CD integration to surface downstream impact at code‑commit time, and natural‑language query interfaces powered by LLMs to let business users ask questions like “how is this metric calculated,” currently in prototype validation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
