How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy
ByteDance revamped its traditional Lambda architecture for e‑commerce traffic data by introducing a new lake ingestion solution that reduces development and operational costs, ensures timely and stable data, and outlines future plans covering business background, ODS lake design, archiving tags, delayed data handling, and real‑time stability.
Introduction: The article introduces optimizations to traditional data flow architecture for e‑commerce traffic data, proposing a new lake ingestion solution that lowers development and O&M costs while ensuring data timeliness and stability, and presents future plans.
Business Background
ByteDance initially used a Lambda design for e‑commerce traffic data, but as data volume and granularity grew, the architecture's drawbacks became evident, causing high development and operational costs and slow response to user needs.
ODS Lake Solution
The traditional Lambda architecture suffers from code maintenance, logic separation, redundant pipelines, and latency issues. The improved solution aims for low maintenance, unified logic, and high timeliness, targeting goals such as reduced cost, consistent logic, and fast data delivery.
Data Ingestion Logic
Data is written in real time to partitions based on business time (event_time) using FlinkSQL. Each record creates a
Record(col_1, col_2, event_time, date, hour). Records are written to Hudi partition files, and Flink checkpoints trigger Hudi transaction commits, making the batch visible downstream after successful commit.
Archiving Tag Generation
Archiving tags are generated based on global minimum event_time and checkpoint information. Pseudocode:
<code>currentMinEventTime = Math.max(minEventTime, currentMinEventTime); // update global min
while (currentMinEventTime - tagDuration > partitionEventTime) {
tag_success(partitionEventTime); // mark partition archived
partitionEventTime = partitionEventTime + 1day/1hour/10min;
}</code>Delayed Data Processing
Delayed data leads to missing records (e.g., 10% loss of add‑to‑cart clicks). The solution detects delays early, blocks scheduling, and rewrites records to the next partition when a SUCCESS tag exists, ensuring data completeness.
Real‑Time Data Stability Assurance
Stability is achieved by recording readiness of two links in offline Hive tables and using signal mechanisms to trigger downstream tasks. The approach provides automated switching without manual intervention, with latency about 5 minutes and data consistency above 99.99% between real‑time and offline tables.
Future Planning
Future work includes implementing a unified stream‑batch solution across DWD and DWM layers, extending ETL logic on top of ODS, and applying the pipeline to high‑traffic scenarios such as large promotions.
Source: ByteDance Data Platform
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.