Big Data 11 min read

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

This article presents Douyin Group’s ByteLake, a heavily customized Apache Hudi‑based data lake table framework, detailing its core concepts, metadata services, write and read optimizations, operational challenges, a fully managed table management service, and its integration with the Amoro open‑source platform.

DataFunSummit
DataFunSummit
DataFunSummit
Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

The article shares Douyin Group’s practice of optimizing and managing data lake tables, introducing their proprietary format called ByteLake, which is a deep customization of Apache Hudi.

ByteLake supports ACID transactions, incremental consumption updates, and unified lake‑warehouse metadata management, serving data‑warehouse analytics, interactive analysis, and feature engineering scenarios.

The core concept of ByteLake is the timeline , inherited from Hudi, which records actions (ReplaceCommit, DeltaCommit, Compaction, Clean) with their type, start timestamp, and state managed by a state machine (request → inflight → completed).

At the service layer, ByteLake provides two metadata services: ByteLake Metastore (BMS), offering timeline and snapshot APIs with concurrency controls such as read heartbeats and commit conflict detection, and a pluggable storage layer (HDFS or MySQL, typically distributed MySQL); and Global Catalog Service, which presents a Hive‑compatible view, routes metadata across data centers, and maps Hive partitions to ByteLake tables for seamless format conversion.

ByteLake optimizes Hudi’s file layout by discarding the file‑group/file‑slice model in favor of bucket‑based organization and pure append writes, eliminating the tagging step and dramatically improving write QPS and stream‑upsert performance.

For reads, ByteLake moves deduplication to the engine layer, using Spark’s window operator on bucket‑based splits, performing a local sort and a top‑1 selection, which yields better performance than Hudi’s MergeOnRead approach.

The article discusses challenges with Hudi’s compaction: synchronous compaction causes back‑pressure in streaming writes, while asynchronous compaction shares resources with write tasks and requires periodic scheduling, leading to stability and performance trade‑offs; independent compaction tasks, though ideal, incur high maintenance costs.

To address these issues, Douyin introduced a fully managed table management service that handles asynchronous scheduling, submission, execution, and monitoring of optimization tasks, also providing TTL capabilities. The service architecture consists of a stateless, multi‑instance API Server, a master/standby Scheduler responsible for task lifecycle and resource management, and Light‑process‑worker nodes that execute lightweight tasks in Kubernetes or submit heavy tasks to YARN/Presto.

The service currently operates across six data centers, managing over 10,000 ByteLake tables and dispatching roughly 500,000 optimization tasks per day.

Finally, the article outlines the integration with the open‑source Amoro platform to fill gaps such as metadata visibility and standardized operational tools, aiming to extend Amoro’s support beyond Iceberg, abstract generic process interfaces for complex table‑management tasks, and promote distributed deployment for large‑scale production environments.

Big Datadata lakeApache HudiByteLakeAmoroTable Management
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.