Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel
This article details Bilibili's implementation of a hundred‑terabyte‑per‑day data synchronization pipeline, covering tool selection between DataX‑based Rider and SeaTunnel‑based AlterEgo, architecture design, performance tuning, logging optimization, rate‑limiting strategies, and comprehensive monitoring for large‑scale offline data ingestion and export.
The article introduces Bilibili's practice of using Apache SeaTunnel for massive data synchronization, handling over 100 TB of daily data across billions of records.
Tool Selection : Two offline data pipelines are used – the DataX‑based Rider project and the SeaTunnel 1.1.3‑based AlterEgo project. Rider supports T+1/H+1 scheduling, reads from HTTP, MySQL, BOSS, and writes directly to HDFS. SeaTunnel offers distributed execution on Spark, richer plugins, and better performance for large‑scale workloads.
Architecture : Data flows from external sources (HTTP, MySQL, etc.) into the data warehouse, where business processing occurs, then out to storage systems such as ClickHouse and MySQL. Rider’s architecture is shown in a diagram, while AlterEgo follows a simple Input → Filter → Output workflow.
Cluster Scale : The offline cluster consists of more than 20 nodes (750+ CPU cores, 1.8 TB memory). Daily synchronization processes handle over a thousand billion records and exceed 100 TB of data.
Platformization : A UI abstracts task configuration, converting user selections into SeaTunnel or DataX JSON configurations. Users can choose source tables, target databases, and mapping via drag‑and‑drop. Sensitive credentials are hidden, and import modes such as Insert Ignore/Insert Update are supported.
Logging Optimization : To simplify troubleshooting, a LogAgent forwards Spark logs to a custom log service, adding jobId, jobHistoryId, and priority fields. This enables alerting (e.g., zero‑row exports) and direct error location within the platform.
Speed and Rate Limiting : ClickHouse export supports three methods – writing to distributed tables, writing to local tables (with an extra repartition step), and BulkLoad (pushing write pressure to Spark, achieving high throughput). BulkLoad dramatically reduces write latency. For TiDB, bulk writes cause high I/O; a custom KV store (TaiShan) and BulkLoad are evaluated as alternatives. Rate limiting is handled via sleep‑based throttling, token‑bucket algorithms, and BBR‑style adaptive control.
Monitoring : Metrics such as job duration, speed, data volume, retry counts, and TiDB/MySQL specific insert/update counts are collected via Spark accumulators and sent to a message queue, then visualized in Grafana. DAG views help identify upstream bottlenecks, and alerts are triggered via a Sensor component.
Q&A Highlights :
BulkLoad improves performance by writing data to local disks first, then loading into storage, leveraging distributed processing.
Downstream pressure is sensed via response time or failure rates; advanced BBR algorithms can provide finer control.
Byte‑level monitoring is achieved with custom accumulators that report byte counts.
DataX and SeaTunnel can replace each other, offering high availability through configuration swaps.
The session concludes with thanks and a reminder to like, share, and follow the DataFun community.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.