Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili
Bilibili adopted Apache Gravitino as a unified metadata platform that decouples consumers, consolidates schemas and Fileset‑based unstructured data across heterogeneous sources, cuts metadata and storage costs, resolves inconsistencies, boosts Hive Metastore performance, and enables features such as Iceberg branching and future AI‑centric governance.
Traditional big‑data metadata management systems that rely on HiveMetaStore face many challenges such as high coupling, limited data‑governance capabilities, and poor support for semi‑structured and unstructured data. With the rapid growth of data lakes, AI data, and increasing security requirements, a unified metadata management solution is needed.
Bilibili introduced Apache Gravitino to provide a centralized metadata platform that unifies metadata views across heterogeneous data sources, manages schema information for cross‑source tasks, and governs unstructured data through the Fileset concept, delivering significant operational benefits.
The presentation is organized around four topics:
Metadata management pain‑point analysis
Apache Gravitino background overview
Apache Gravitino production practice
Apache Gravitino planning outlook
1. Pain‑point analysis
Business‑side coupling is too high; metadata consumers access heterogeneous sources in many ways.
Data‑governance capabilities are limited; no centralized audit, permission, or scheduling.
Lack of effective management for semi‑structured and unstructured data (e.g., Kafka schemas, HDFS files).
High maintenance cost for cross‑source schema.
2. Gravitino background
Gravitino aims to provide a centralized metadata service with the following goals:
Decouple metadata consumers from underlying components.
Maintain unified schema across sources.
Offer unified audit, TTL, and HDFS EC features for cost reduction.
Govern AI data assets centrally.
Establish a unified permission mechanism to improve data security.
Compared with open‑source alternatives such as Metacat, WaggleDance, and Open Metadata, Gravitino supports multiple engines, offers richer features, and has an active community.
3. What is Gravitino?
Gravitino is a high‑performance metadata platform that handles various metadata types from different sources and provides AI data‑asset management. Its architecture consists of four layers:
Data‑application layer : data platforms, Bilibili services, and engines (e.g., SQL Scan, SDM).
Interface layer : unified REST API, plus Thrift and JDBC interfaces for engine integration.
Core layer : catalog‑based management of catalogs, schemas, tables, and Filesets. Metadata is stored externally; Gravitino maintains only catalog information.
Connection layer : connectors to Hive Metastore, Iceberg REST Catalog, etc.
Key capabilities:
Unified metadata view for multiple data sources.
Support for semi‑structured and unstructured data (HDFS files, AI files, Kafka schemas).
Integration with Trino, Spark, and Flink.
Highly active open‑source community.
4. Fileset concept
A Fileset represents a directory of HDFS files and its metadata (size, file count, creation time). Bilibili built OneMeta on top of Gravitino to expose a unified metadata service, extending catalog APIs to provide fine‑grained Fileset information.
5. Production practice (OneMeta)
OneMeta offers a unified metadata service built on Gravitino, extending REST, Thrift, and JDBC interfaces.
Catalog extensions provide custom interfaces for partition filtering, batch queries, and Fileset‑level browsing.
Decoupled OneMeta code from Gravitino core, following the open‑closed principle, reducing upgrade cost.
Benefits of the architecture evolution:
Reduced metadata usage cost by decoupling consumers from heterogeneous sources.
Eliminated metadata inconsistency between Iceberg SDK and Hive Metastore.
Improved Hive Metastore performance: connection pooling and concurrent handling reduced response time by ~70%.
6. Cross‑source schema management
Previously, users manually wrote DDL for memory‑catalog or keeper‑catalog, specifying Kafka cluster details. With OneMeta, users only need to provide a Kafka topic name; Flink retrieves schema information from the catalog automatically.
Kafka schema handling supports JSON, delimiter, and Protobuf (PB). For PB, compiled files are stored in HDFS and loaded via a custom classloader at runtime.
7. HDFS file governance
Bilibili’s HDFS storage reaches EB scale, with non‑table paths accounting for ~30% of storage. EC and TTL policies applied through OneMeta are expected to save >100 PB (EC) and >300 PB (TTL).
8. GVFS (Gravitino Virtual File System)
GVFS provides a virtual file‑system layer for Fileset access:
Java implementation based on Hadoop Compatible File System (HCFS).
Python implementation based on fsspec.
Usable from Spark JAR, Spark SQL, and Python clients.
9. Iceberg branch support
Gravitino enables table‑level branching for Iceberg, allowing multiple schema variants to share the same underlying data, reducing storage duplication and simplifying experiment workflows.
10. Planning outlook
Unified permission management across data sources.
UDF resource management and versioning for AI workloads.
Integration of additional internal data sources.
OneMeta to provide a uniform access pattern for all engines.
AI model lifecycle governance.
Extend Fileset to object storage and integrate with streaming, batch, and OLAP engines.
Key outcomes
Significant reduction in metadata usage cost.
Resolution of metadata inconsistency issues.
Optimized HDFS file governance, saving hundreds of PB.
Improved task efficiency through cross‑source schema and Fileset management.
Enhanced functionality with Iceberg branch support.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.