Big Data 10 min read

ByteLake: ByteDance’s Real‑Time Data Lake Platform Built on Apache Hudi

This article presents ByteDance’s ByteLake, a real‑time data lake platform built on Apache Hudi, covering Hudi fundamentals, ByteLake’s use cases, the platform’s architectural optimizations, new features such as a commit‑based metastore and bucket indexing, and future roadmap plans.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
ByteLake: ByteDance’s Real‑Time Data Lake Platform Built on Apache Hudi

This article shares ByteDance’s real‑time data lake platform ByteLake, which is based on Apache Hudi.

The content is organized into four parts: an introduction to Hudi, ByteLake’s application scenarios, the optimizations and new features ByteDance has added, and a look at future plans.

Hudi is a streaming data‑lake system that provides ACID guarantees, supports real‑time incremental consumption as well as offline batch updates, and can be queried through engines such as Spark, Flink, and Presto.

A Hudi table consists of a timeline and file groups. The timeline is made up of commits, each representing a write operation.

Compared with traditional warehouses, Hudi requires every record to have a unique primary key, and within a partition the same key can exist in only one file group. Storage is organized into base files and log files; log files record updates and are periodically compacted into new base files, resulting in multiple versions co‑existing.

Hudi tables come in two types: COW (Copy‑on‑Write) for offline batch updates, where updates are merged into new base files, and MOR (Merge‑on‑Read) for high‑frequency real‑time updates, where changes are first written to log files and later compacted.

Hudi supports several indexing methods, including Bloom‑filter index, HBase index, and a Bucket index (still under development).

ByteLake extends Hudi to provide second‑level data visibility and real‑time warehousing capabilities, while also adding a set of custom features described later in the article.

A typical data pipeline starts with MySQL binlog being published to Kafka. Real‑time streams are consumed directly by Spark Streaming or Flink and written to the lake for downstream use, while batch pipelines dump the binlog to HDFS and update the lake on an hourly or daily basis.

In ByteDance’s recommendation scenario, data must be exported from an HBase‑like store to offline storage for efficient OLAP, and serving data together with client‑side event data must be merged by primary key to create machine‑learning samples; ByteLake enables low‑cost, batch‑style feature column addition.

For warehouse back‑fill, massive tables (hundreds of PB) require partial row/column updates; using the data lake dramatically reduces compute consumption and improves end‑to‑end performance.

Analytical tables built from multiple sources traditionally require each source to be dumped to Hive, then joined on primary key, leading to day‑level latency; the data‑lake approach makes real‑time possible and provides column‑concatenation, greatly boosting downstream analysis speed.

ByteLake introduces a new metadata management system, ByteLake Metastore, which replaces Hive Metastore’s directory‑based approach with commit‑based metadata. The architecture consists of an engine layer, a metadata layer, and a storage layer. The metadata layer offers a unified view compatible with HMS, routes requests via a Catalog Service, and abstracts heterogeneous underlying metastores.

ByteLake Metastore supports commit‑style metadata using optimistic locking and CAS, persists snapshots, caches frequently accessed metadata and indexes, and provides partition pruning. Storage is pluggable (HDFS, KV stores, MySQL), lightweight, stateless, horizontally scalable, and HMS‑compatible.

Concurrency is handled with optimistic locks and conflict‑checking during instant state transitions (requested → inflight → completed). Row‑level conflicts prevent two instants from writing the same file group; column‑level conflicts allow writes to the same file group only if their schemas are disjoint.

Existing Hudi indexes have drawbacks: Bloom‑filter suffers false positives at large scale, and HBase index adds operational overhead. ByteDance proposes a lightweight Bucket Index that hashes each partition into N buckets, each mapping to a file group, enabling fast placement of updates.

Bucket‑based query optimizations include Bucket Pruning (reading only relevant buckets) and Bucket Join (reducing shuffle by leveraging bucket distribution).

Hudi requires a unique primary key and a compare column for updates; locating the target file group involves building a primary‑key‑to‑file‑group map via the index and joining with incoming updates.

For log‑only scenarios without a primary key, a NonIndex mode is introduced where updates are simply appended, avoiding index construction and joins, thus improving ingestion latency.

Future work will gradually contribute these new features back to the open‑source community.

big datareal-time analyticsdata lakemetadata managementApache HudiBucket IndexByteLake
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.