Big Data 18 min read

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

This article outlines the background, development stages, architectural evolution, key features such as incremental updates and quality metrics, and future directions of the data lineage capability within Volcano Engine's DataLeap big‑data governance platform.

DataFunSummit
DataFunSummit
DataFunSummit
Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

01 Background Introduction

DataLeap, a big‑data governance suite under Volcano Engine's VeDI platform, provides end‑to‑end data‑mid‑platform capabilities such as integration, development, operation, governance, asset management, and security, reducing cost and unlocking data value for enterprise decision‑making. Data lineage is the foundational ability that helps users locate, understand, and exploit data.

1. Data lineage as a core capability of the data‑asset platform

In DataLeap, the data‑asset platform offers metadata search, display, asset management, and knowledge discovery. Data lineage enables users to find data, comprehend its origins, and realize its value.

2. ByteDance’s data‑link situation

Data originates from two main sources: tracking data (collected from APP and Web, logged, and sent to message queues) and business data (typically stored online in systems like RDS). The middle layer consists of offline warehouses (e.g., Hive) that ingest data from queues or online stores, process it, and forward it to OLAP engines such as ClickHouse. Message queues may also fan‑out data via Flink or other tasks.

Data destinations

Processed data is mainly consumed by metric systems (e.g., daily active users) and reporting systems that visualize these metrics.

Data services

APIs expose data from queues, online stores, downstream consumers, and the data‑flow shown on the right side of the diagram, all of which fall within the scope of data lineage.

Lineage Development Overview

The evolution of lineage at ByteDance is divided into three stages.

First stage (around 2019)

Provides basic lineage capabilities for Hive and ClickHouse, supporting table‑level and column‑level lineage for over 10 metadata types.

Second stage (starting early 2020)

Introduces task lineage and expands supported metadata to more than 15 types.

Third stage (from mid‑2021 to present)

Performs a GMA overhaul of the metadata system and upgrades the lineage architecture, adding richer functions such as near‑real‑time updates (within one minute), change‑notification via message queues, lineage‑quality assessment, and standardized onboarding.

02 Evolution of Data‑Lineage Architecture

1. First version: basic capabilities and initial use cases

Architecture

Lineage data is generated from two sources: the data‑development platform (users write tasks) and third‑party platforms that compute tracking data. A daily offline job creates lineage snapshot files, which are compared day‑by‑day to detect changes and load them into a graph database. Metadata is duplicated in the graph for quick access.

Storage includes a graph database plus auxiliary MySQL and index stores. Consumption is limited to API queries.

Storage model

Two separate graphs: one for table‑level lineage (Hive, ClickHouse tables) and one for column‑level lineage, each containing duplicated metadata.

2. Second version: expanding value and use cases

Architecture

Removes metadata duplication and pre‑computed statistics, introduces a new pipeline that imports lineage snapshots into an offline warehouse for batch analysis and monitoring.

Adds task‑type nodes to support three traversal scenarios: pure data lineage, mixed data‑and‑task lineage, and pure task lineage.

Storage model

Unifies the two previous graphs into a single graph, enabling table‑to‑column traversal, and adds task‑type nodes.

3. Third version: lineage as a core data‑value capability

Architecture

Expands data sources to include reporting and third‑party profiling platforms. Introduces real‑time consumption, plugin‑based parsers for different task types, unified metadata storage (graph + index), and a validation module that leverages ingestion‑point data to assess lineage quality.

Storage model

Task‑centric graph: tasks become central nodes linking source and target tables. Table and column lineage are unified via virtual tasks when necessary.

Incremental updates

Supports fine‑grained updates when a task’s logic changes, by creating, deleting, or modifying edges in the graph.

Lineage standardization

Provides an ETL‑style pipeline where lineage data is parsed, filtered, transformed into events, and written to the asset platform via a sink; a shared SDK abstracts common logic while allowing custom components.

Lineage quality – coverage

Coverage = number of assets with lineage / number of assets of interest (assets with production tasks). Example: 8 of 9 relevant tables are covered → 88% coverage.

Lineage quality – accuracy

Accuracy = correctly parsed tasks / total tasks of the same type. Example: 2 correct out of 4 tasks → 50% accuracy.

Current status at ByteDance

Metadata coverage: Hive 98%, ClickHouse 96%, Kafka topics 70%. Accuracy: DTS integration tasks >99%, Hive SQL ~97%, Flink SQL ~81%.

03 Architecture Comparison

Comparison of the three versions shows progressive improvements: consumption methods evolve from API‑only to offline‑warehouse analysis to message‑queue incremental updates; incremental updates start in version 3; task lineage appears in version 2 and data‑quality in version 3; metadata storage is unified in version 3; onboarding time drops from 7‑10 days to 3‑4 days with standardization.

04 Future Outlook

1. Continue simplifying architecture and unifying offline/real‑time tasks. 2. Expand ecosystem support to external/open‑source metadata and provide one‑stop lineage standardization. 3. Improve lineage quality and enable rapid diagnosis of complex data‑link issues. 4. Offer intelligent scenarios by exposing key lineage chains to accelerate troubleshooting.

The described lineage capabilities are already offered to external users via Volcano Engine DataLeap.

big datareal-time processingmetadatadata lineageData GovernanceDataLeap
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.