Big Data 9 min read

Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations

This article details ByteDance's implementation of an exabyte‑scale data lake using Apache Hudi, covering scenario requirements, engine selection, functional support, schema management, extensive performance tuning, and future directions, while also noting recruitment opportunities within the team.

DataFunTalk
DataFunTalk
DataFunTalk
Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations

ByteDance engineer Guanzhi Yue shares a comprehensive case study on constructing an exabyte‑level data lake with Apache Hudi for the company's recommendation system.

The discussion is organized into five parts: scenario requirements, design selection, functional support, performance tuning, and future outlook.

Two primary scenarios are described: (1) using a BigTable‑like storage (TBase) for near‑line processing and exporting data to an offline lake for OLAP workloads, and (2) leveraging the lake for feature engineering and model training, requiring primary‑key based merges of instance and label streams at massive scale.

Key challenges include highly irregular data without complete WAL rows, extremely high throughput (hundreds of GB/s per table, petabyte‑scale storage), and complex, high‑dimensional schemas with thousands of columns.

After evaluating Hudi, Iceberg, and DeltaLake, Hudi was chosen for its open ecosystem, global indexing support, and customizable storage interfaces.

Functional extensions implemented on top of Hudi include MVCC‑aware payloads with Avro‑based timestamped schemas, HBase‑style append semantics for list columns, and a metadata center providing atomic schema changes, versioned schemas, column‑level encoding, and fast in‑process schema access.

Performance optimizations address serialization overhead by using JVM‑singletons for schemas, reducing payload serialization frequency, and employing a compiled Avro implementation; compaction was refined with independent deployment scripts, rule‑based strategies, heuristic scheduling, and removal of WriteStatus caches.

HDFS tuning involved replacing HSync with HFlush, aggressive pipeline configuration via a new API, and isolating real‑time writes using logfile‑level I/O isolation.

Future work will focus on productizing the solution to lower operational complexity, expanding ecosystem integration (especially with Flink), further cost and performance improvements, and enriching storage semantics beyond table formats.

The article concludes with a recruitment notice for the recommendation architecture team, inviting interested candidates to contact via WeChat or email.

performance optimizationBig Datadata lakeApache HudiByteDanceSchema Management
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.