Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse
Bilibili replaced its Hive‑Spark‑Presto ETL pipeline with a lakehouse built on Iceberg, using Magnus, Trino and Alluxio to unify a PB‑scale data lake and warehouse, adding Z‑Order sorting and indexing for fast multi‑dimensional queries while planning further schema and pre‑computation optimizations.
This article discusses Bilibili's implementation of a lakehouse architecture to address the challenges of managing PB-level data in their big data platform. The authors explain the limitations of their previous data processing workflow, which involved ETL processing and data warehousing using Hive, Spark, and Presto, and the need for a more efficient and cost-effective solution.
The article defines the concepts of data lakes and data warehouses, highlighting the flexibility of data lakes but also their limitations in data management and query efficiency. It then introduces the concept of lakehouse, which aims to combine the flexibility of data lakes with the efficiency of data warehouses.
Bilibili chose to evolve their data lake architecture towards a lakehouse using Iceberg, an open table storage format. The article explains why Iceberg is suitable for building a lakehouse architecture, emphasizing its ability to self-organize table metadata and provide snapshot and transaction support.
The authors describe their lakehouse architecture, which includes data ingestion from various sources, Magnus (an Iceberg intelligent management service), and Trino as the query engine with Alluxio for metadata and index caching. They also discuss enhancements made to Iceberg, such as Z-Order sorting and indexing, to improve query performance in multi-dimensional analysis scenarios.
The article concludes by sharing Bilibili's experience with the lakehouse architecture, including its successful implementation with PB-level data and thousands of daily queries. The authors also mention future directions for further enhancing Iceberg, such as star schema data organization, pre-computation, and intelligent query pattern analysis.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.