Big Data 14 min read

Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

Bilibili replaced its Hive‑Spark‑Presto ETL pipeline with a lakehouse built on Iceberg, using Magnus, Trino and Alluxio to unify a PB‑scale data lake and warehouse, adding Z‑Order sorting and indexing for fast multi‑dimensional queries while planning further schema and pre‑computation optimizations.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

This article discusses Bilibili's implementation of a lakehouse architecture to address the challenges of managing PB-level data in their big data platform. The authors explain the limitations of their previous data processing workflow, which involved ETL processing and data warehousing using Hive, Spark, and Presto, and the need for a more efficient and cost-effective solution.

The article defines the concepts of data lakes and data warehouses, highlighting the flexibility of data lakes but also their limitations in data management and query efficiency. It then introduces the concept of lakehouse, which aims to combine the flexibility of data lakes with the efficiency of data warehouses.

Bilibili chose to evolve their data lake architecture towards a lakehouse using Iceberg, an open table storage format. The article explains why Iceberg is suitable for building a lakehouse architecture, emphasizing its ability to self-organize table metadata and provide snapshot and transaction support.

The authors describe their lakehouse architecture, which includes data ingestion from various sources, Magnus (an Iceberg intelligent management service), and Trino as the query engine with Alluxio for metadata and index caching. They also discuss enhancements made to Iceberg, such as Z-Order sorting and indexing, to improve query performance in multi-dimensional analysis scenarios.

The article concludes by sharing Bilibili's experience with the lakehouse architecture, including its successful implementation with PB-level data and thousands of daily queries. The authors also mention future directions for further enhancing Iceberg, such as star schema data organization, pre-computation, and intelligent query pattern analysis.

Big DataindexingQuery OptimizationData WarehouseData LakeiceberglakehouseZ-Order sorting
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.