Big Data 19 min read

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

This article presents an in‑depth overview of DataLeap's data lineage capabilities, covering the challenges, multi‑layer model design, implementation with Apache Atlas and JanusGraph, performance optimizations, diverse use cases across asset, development, governance and security domains, and future trends for lineage technology.

DataFunTalk
DataFunTalk
DataFunTalk
Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

DataLeap, the big‑data R&D governance suite of Volcano Engine, provides end‑to‑end data lineage to help users integrate, develop, operate, govern, and secure data assets, reducing costs and unlocking data value for enterprise decision‑making.

1. Data Lineage Model – Challenges The lineage model must address scalability for rapidly growing metadata, high performance for insert/update operations, timeliness to avoid stale lineage, and business enablement by balancing technical cost with business benefit.

2. Data Lineage Model – Presentation Layer Assets from various metadata sources (Hive, ClickHouse, Kafka, ES, Redis) are unified and displayed as nodes, with edges representing production relationships between upstream and downstream assets.

3. Data Lineage Model – Abstract Layer The abstract model consists of asset nodes (tables, topics) and task nodes (jobs). Examples include FlinkSQL consuming a Kafka topic and writing to a Hive table, schema propagation, and sub‑task linking.

4. Data Lineage Model – Implementation Layer Implemented on top of Apache Atlas, extending its DataSet and Process types with ByteDance‑specific metadata and sub‑task definitions to store lineage information.

5. Data Lineage Model – Storage Layer Uses Atlas's native graph database JanusGraph (backed by HBase) to store edges as properties of asset nodes, with optional migration to OLTP databases such as MySQL for performance or cost reasons.

6. Data Lineage Optimizations

• Real‑time updates: two approaches—engine‑side hook during task execution or task‑platform notifications via API/MQ; the latter was chosen, reducing latency from days to minutes.

• Query optimization: batch query support added to JanusGraph to improve multi‑node lineage queries, with asynchronous processing for high‑traffic assets.

• Open export: lineage can be exported to Excel, warehouse tables, APIs, or streamed via topics for downstream consumption.

7. Use Cases

• Asset domain: lineage drives asset heat‑map calculations (PageRank‑style) and helps users understand data provenance.

• Development domain: supports impact analysis (pre‑change impact) and root‑cause attribution (post‑incident debugging) by tracing upstream/downstream dependencies.

• Governance domain: enables link‑status tracking for SLA assurance and data‑warehouse cleanup by identifying redundant tables.

• Security domain: enforces security‑level propagation rules and automates security‑tag labeling across lineage graphs.

8. Future Outlook

• Generalized lineage parsing: building a standard SQL parser for universal lineage extraction.

• Non‑intrusive collection for non‑SQL jobs (e.g., JAR tasks) to capture runtime lineage.

• Temporal lineage: storing lineage snapshots over time to support time‑based impact analysis.

• End‑to‑end lineage across front‑end, back‑end, and reporting layers, and extending capabilities to cloud environments with heterogeneous data types.

Overall, DataLeap's lineage solution reduces development effort, improves data quality, and provides a foundation for advanced data governance and analytics.

big datagraph databasedata platformdata lineageData Governancemetadata managementApache Atlas
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.