Big Data 14 min read

Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

Bilibili replaced its Hive‑Spark‑Presto ETL pipeline with a lakehouse built on Iceberg, using Magnus, Trino and Alluxio to unify a PB‑scale data lake and warehouse, adding Z‑Order sorting and indexing for fast multi‑dimensional queries while planning further schema and pre‑computation optimizations.

Bilibili Tech

Feb 17, 2022

Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

This article discusses Bilibili's implementation of a lakehouse architecture to address the challenges of managing PB-level data in their big data platform. The authors explain the limitations of their previous data processing workflow, which involved ETL processing and data warehousing using Hive, Spark, and Presto, and the need for a more efficient and cost-effective solution.

The article defines the concepts of data lakes and data warehouses, highlighting the flexibility of data lakes but also their limitations in data management and query efficiency. It then introduces the concept of lakehouse, which aims to combine the flexibility of data lakes with the efficiency of data warehouses.

Bilibili chose to evolve their data lake architecture towards a lakehouse using Iceberg, an open table storage format. The article explains why Iceberg is suitable for building a lakehouse architecture, emphasizing its ability to self-organize table metadata and provide snapshot and transaction support.

The authors describe their lakehouse architecture, which includes data ingestion from various sources, Magnus (an Iceberg intelligent management service), and Trino as the query engine with Alluxio for metadata and index caching. They also discuss enhancements made to Iceberg, such as Z-Order sorting and indexing, to improve query performance in multi-dimensional analysis scenarios.

The article concludes by sharing Bilibili's experience with the lakehouse architecture, including its successful implementation with PB-level data and thousands of daily queries. The authors also mention future directions for further enhancing Iceberg, such as star schema data organization, pre-computation, and intelligent query pattern analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse data lake Iceberg Z-Order sorting

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.