Douyin Group Data Asset Management Platform: Full‑Stack Data Lineage Evolution and Applications
This article introduces Douyin Group’s end‑to‑end data asset management platform, explains the evolution and architecture of its large‑scale data lineage system, presents quality metrics and ecosystem components, and outlines practical applications and future directions for data governance, development, and security.
The Douyin Group has built a one‑stop data‑asset portal that goes beyond traditional metadata collection, focusing on a systematic "manage‑find‑use" approach to serve precise data‑search needs across complex business scenarios.
The platform ingests diverse data sources into a unified metadata lake, enriches assets with active metadata, and evaluates asset completeness through an asset‑assessment framework. It powers search, portal, recommendation, and AI‑driven search capabilities for data‑asset consumption.
Data Lineage Overview
Douyin aims to construct a real‑time, comprehensive, and accurate big‑data lineage that underpins all downstream applications, recognizing lineage as the core of metadata.
Motivation: visualize massive task graphs, ensure production quality, safeguard data security, and reduce resource costs.
Lineage coverage includes source/ingestion lineage, production (real‑time & offline) lineage, and application‑level lineage.
Lineage Model Abstraction
Two graph models are used: a dense model (fast reads, slower updates) and a lightweight model (fast updates, slower reads). The generalized model abstracts three entity types—DataStore (e.g., Hive tables), Column, and Process (tasks)—and defines six relationship types to capture table‑level, column‑level, and operator‑level lineage.
Quality Metrics
Lineage coverage rate – proportion of tasks successfully parsed.
Lineage accuracy rate – correctness of parsed relationships.
Lineage completeness rate – extent to which lineage fully covers data flows.
System Architecture
The architecture addresses challenges of fine‑grained parsing, non‑structured sources (e.g., Redis, Kafka), cross‑region lineage, and large‑scale application‑level lineage. It consists of data source collection, metadata & lineage ingestion, graph storage (JanusGraph/Neo4j/NebulaGraph), and unified analysis services supporting both real‑time and offline scenarios.
Unified Parsing Service
Combines Antlr (lexical & syntactic parsing) with Calcite (SQL‑centric parsing) to support multiple dialects and complex scripts, converting Antlr ASTs to Calcite SQLNodes for lineage extraction.
Lineage Access Services
Production lineage – extracts table‑to‑table and column‑to‑column dependencies from ETL jobs.
Cross‑region lineage – aggregates local lineage and stitches it across regions via a message bus.
Application lineage – captures end‑to‑end dependencies from low‑code platforms, RPC/HTTP calls, and trace logs, enabling impact analysis and security checks.
Application Scenarios
Lineage supports data development (impact assessment, field‑level debugging, real‑time task shadowing), data governance (low‑value asset identification, cost accounting, timeliness, accuracy, and security assurance), and broader data‑asset use cases.
Future Outlook
Plans include full‑coverage lineage, standardized APIs for community contribution, finer‑grained (row‑level) lineage, and deeper integration of lineage insights into data quality, efficiency, and security workflows.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.