Big Data 16 min read

Douyin Group Data Asset Management Platform: Full‑Stack Data Lineage Evolution and Applications

This article introduces Douyin Group’s end‑to‑end data asset management platform, explains the evolution and architecture of its large‑scale data lineage system, presents quality metrics and ecosystem components, and outlines practical applications and future directions for data governance, development, and security.

DataFunSummit

Jan 1, 2025

Douyin Group Data Asset Management Platform: Full‑Stack Data Lineage Evolution and Applications

The Douyin Group has built a one‑stop data‑asset portal that goes beyond traditional metadata collection, focusing on a systematic "manage‑find‑use" approach to serve precise data‑search needs across complex business scenarios.

The platform ingests diverse data sources into a unified metadata lake, enriches assets with active metadata, and evaluates asset completeness through an asset‑assessment framework. It powers search, portal, recommendation, and AI‑driven search capabilities for data‑asset consumption.

Data Lineage Overview

Douyin aims to construct a real‑time, comprehensive, and accurate big‑data lineage that underpins all downstream applications, recognizing lineage as the core of metadata.

Motivation: visualize massive task graphs, ensure production quality, safeguard data security, and reduce resource costs.

Lineage coverage includes source/ingestion lineage, production (real‑time & offline) lineage, and application‑level lineage.

Lineage Model Abstraction

Two graph models are used: a dense model (fast reads, slower updates) and a lightweight model (fast updates, slower reads). The generalized model abstracts three entity types—DataStore (e.g., Hive tables), Column, and Process (tasks)—and defines six relationship types to capture table‑level, column‑level, and operator‑level lineage.

Quality Metrics

Lineage coverage rate – proportion of tasks successfully parsed.

Lineage accuracy rate – correctness of parsed relationships.

Lineage completeness rate – extent to which lineage fully covers data flows.

System Architecture

The architecture addresses challenges of fine‑grained parsing, non‑structured sources (e.g., Redis, Kafka), cross‑region lineage, and large‑scale application‑level lineage. It consists of data source collection, metadata & lineage ingestion, graph storage (JanusGraph/Neo4j/NebulaGraph), and unified analysis services supporting both real‑time and offline scenarios.

Unified Parsing Service

Combines Antlr (lexical & syntactic parsing) with Calcite (SQL‑centric parsing) to support multiple dialects and complex scripts, converting Antlr ASTs to Calcite SQLNodes for lineage extraction.

Lineage Access Services

Production lineage – extracts table‑to‑table and column‑to‑column dependencies from ETL jobs.

Cross‑region lineage – aggregates local lineage and stitches it across regions via a message bus.

Application lineage – captures end‑to‑end dependencies from low‑code platforms, RPC/HTTP calls, and trace logs, enabling impact analysis and security checks.

Application Scenarios

Lineage supports data development (impact assessment, field‑level debugging, real‑time task shadowing), data governance (low‑value asset identification, cost accounting, timeliness, accuracy, and security assurance), and broader data‑asset use cases.

Future Outlook

Plans include full‑coverage lineage, standardized APIs for community contribution, finer‑grained (row‑level) lineage, and deeper integration of lineage insights into data quality, efficiency, and security workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Metadata Data Quality Data Lineage data governance Douyin Data Asset Platform

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.