Metadata Infrastructure and Governance in Bilibili's Data Platform
The article details how Bilibili built a unified metadata infrastructure—including a URN‑based model, collection pipelines, quality assurance, storage in TiDB/ES/HugeGraph, and query services—to support data discovery, lineage, impact analysis, and governance across its growing data platform.
Metadata, as derivative data such as scheduling tasks, Hive tables, topics, fields, storage, quality, and popularity, was initially scattered across various subsystems (e.g., HiveMetaStore, scheduler DB). Early data‑platform development focused on business data needs, with little demand for unified metadata collection.
As the platform scaled, massive amounts of tables and tasks generated metadata that incurred high management and storage costs. New scenarios—model governance, change impact, anomaly detection, and duplicate‑construction avoidance—required a data‑map, lineage map, impact‑analysis tools, asset dashboards, and governance utilities, making a reliable metadata service essential.
Initially, Bilibili implemented custom solutions for specific needs (e.g., direct HMS pulls, Binlog sync, HTTP query interfaces). While functional short‑term, this approach suffered from poor flexibility, high maintenance overhead, and duplicated effort, prompting a shift toward a unified metadata strategy.
The unified strategy aims to standardize the metadata model, collection methods, storage, and query interfaces, enabling consistent support for downstream applications.
Unified Model : The team adopted a URN‑based identification scheme (protocol:datacenter:resource_type:unique_id) and defined 16 resource types (e.g., tables, topics). Entities and relationships are modeled with aspects to separate attributes from different sources, and a builderURN attribute records the entity that constructed a relationship.
Collection : Three collection patterns were evaluated—batch pull, batch push, and embed reporting. The solution combines batch pull (for controllable, high‑quality ingestion) with embed reporting (for non‑core data), assigning conversion logic responsibility to the data‑source owners.
Quality Assurance : A two‑layer mechanism (batch‑level checks and global fallback checks) automatically detects, locates, and resolves inconsistencies caused by hard deletes, transaction issues, or unstable middleware.
Storage : Collected entities are stored in TiDB, searchable metadata in Elasticsearch, and relationship graphs in HugeGraph, providing entity storage, full‑text search, and deep graph traversal capabilities.
Query Service : Generic entity and relationship queries are offered via a SQL‑like parser that translates WHERE clauses (e.g., {"page":1,"size":20,"where":"entity_type = 1 and sec_type = 3 and properties.tabName like '%r_ai.ods.recindexing.archive.test%'"} ) into engine‑specific DSLs. Additional support for multi‑level association queries is provided through an extraProperties payload.
Lineage Construction : Efforts focus on coverage, granularity, and accuracy. Table‑level lineage is built via static SQL parsing; field‑level lineage uses dynamic Hive logs (post‑execution) to achieve higher accuracy; row‑level lineage is currently limited to special cases.
Current Scale : The platform now integrates over 10 metadata sources, 16 entity types, and 10 relationship types, managing >60,000 Hive tables and >110,000 tasks. Generic metadata queries serve ~25,000 PV daily, powering data‑map, impact analysis, lineage map, and data‑extraction services.
Applications :
Data Map – unified search, classification, and hot‑recommendation for tables, topics, ClickHouse tables, and BI reports.
Lineage Map – interactive visualization of entity relationships with dynamic filtering and highlighting.
Impact Analysis – deep traversal of lineage graphs (including field‑level) to assess upstream/downstream effects, with asynchronous execution and caching for performance.
Future Plans include expanding quality‑assurance coverage, building a comprehensive metadata dictionary, establishing data‑operation metrics and governance processes, and scaling governance tooling to reduce duplication and cost.
Author: Shen Wangyang, Senior Development Engineer at Bilibili, responsible for metadata, data‑operation, and data‑management tooling.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.