Data Lineage System Design and Implementation for Big Data Platforms
This article presents a comprehensive data lineage system (Data-Lineage) for big data platforms, addressing challenges in heterogeneous data sources, multiple execution engines, and complex dependencies through hook-based architecture and modular design.
This paper introduces a comprehensive data lineage system designed to track data flow across complex big data platforms. The system addresses challenges in heterogeneous data sources, multiple execution engines, and complex dependencies through a hook-based architecture.
The paper begins by establishing the context of data lineage as a critical component for data provenance, quality assessment, and metadata management. It describes how data flows through various layers in a typical big data platform, from raw data sources through processing stages to final consumption.
The proposed architecture consists of four main modules: Hook, Collector, Lineage, and Common modules. The Hook module uses plugin-based development to intercept execution engine operations and extract lineage information. The Collector module receives and processes lineage data from various sources. The Lineage module provides query interfaces and SQL parsing capabilities. The Common module offers shared utilities and a custom logging framework.
Technical implementations are detailed for Hive, DataX, Flink, and Impala execution engines. For Hive, the system leverages Post-execution Hooks to capture query plans and extract source/destination table information. For DataX, it implements custom hook functions to parse job configurations. For Flink, it modifies the source code to add hook functionality. For Impala, it parses built-in lineage logs.
The system employs a factory pattern for handling different data source types and uses HTTP-based communication between modules. It includes SQL parsing capabilities for permission verification and metadata management. The Common module provides entity classes, exception handling, enums, utilities, and a custom logging framework specifically designed for hook operations.
Future work includes extending support to additional data sources like MySQL, Oracle, and Kafka, as well as implementing data tagging and popularity analysis based on lineage data.
Beijing SF i-TECH City Technology Team
Official tech channel of Beijing SF i-TECH City. A publishing platform for technology innovation, practical implementation, and frontier tech exploration.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.