Big Data 7 min read

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

ByteDance revamped its traditional Lambda architecture for e‑commerce traffic data by introducing a new lake ingestion solution that reduces development and operational costs, ensures timely and stable data, and outlines future plans covering business background, ODS lake design, archiving tags, delayed data handling, and real‑time stability.

Data Thinking Notes

Oct 11, 2023

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

Introduction: The article introduces optimizations to traditional data flow architecture for e‑commerce traffic data, proposing a new lake ingestion solution that lowers development and O&M costs while ensuring data timeliness and stability, and presents future plans.

Business Background

ByteDance initially used a Lambda design for e‑commerce traffic data, but as data volume and granularity grew, the architecture's drawbacks became evident, causing high development and operational costs and slow response to user needs.

ODS Lake Solution

The traditional Lambda architecture suffers from code maintenance, logic separation, redundant pipelines, and latency issues. The improved solution aims for low maintenance, unified logic, and high timeliness, targeting goals such as reduced cost, consistent logic, and fast data delivery.

Data Ingestion Logic

Data is written in real time to partitions based on business time (event_time) using FlinkSQL. Each record creates a Record(col_1, col_2, event_time, date, hour). Records are written to Hudi partition files, and Flink checkpoints trigger Hudi transaction commits, making the batch visible downstream after successful commit.

Archiving Tag Generation

Archiving tags are generated based on global minimum event_time and checkpoint information. Pseudocode:

currentMinEventTime = Math.max(minEventTime, currentMinEventTime); // update global min
while (currentMinEventTime - tagDuration > partitionEventTime) {
    tag_success(partitionEventTime); // mark partition archived
    partitionEventTime = partitionEventTime + 1day/1hour/10min;
}

Delayed Data Processing

Delayed data leads to missing records (e.g., 10% loss of add‑to‑cart clicks). The solution detects delays early, blocks scheduling, and rewrites records to the next partition when a SUCCESS tag exists, ensuring data completeness.

Real‑Time Data Stability Assurance

Stability is achieved by recording readiness of two links in offline Hive tables and using signal mechanisms to trigger downstream tasks. The approach provides automated switching without manual intervention, with latency about 5 minutes and data consistency above 99.99% between real‑time and offline tables.

Future Planning

Future work includes implementing a unified stream‑batch solution across DWD and DWM layers, extending ETL logic on top of ODS, and applying the pipeline to high‑traffic scenarios such as large promotions.

Source: ByteDance Data Platform

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce Big Data Flink stream processing data lake Hudi

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.