Big Data 17 min read

Rebuilding Shopee's Data Integration Platform with Apache SeaTunnel

Shopee faced fragmented data‑ingestion pipelines, limited source support, and high maintenance overhead, so it evaluated open‑source tools and adopted Apache SeaTunnel to unify batch and streaming data transfers, simplify ETL workflows, and provide a scalable, extensible solution for its multi‑TB daily data processing needs.

Big Data Technology Architecture

Oct 25, 2022

Rebuilding Shopee's Data Integration Platform with Apache SeaTunnel

Shopee generates several terabytes of data daily and needed a unified big‑data platform to support diverse data‑ingestion jobs, many of which were unmanaged, non‑standard, and difficult to maintain.

The existing ecosystem suffered from limited source types (primarily MySQL and TiDB), opaque job management, long execution times, and complex custom pipelines for Hive, ClickHouse, Druid, and other storages.

After evaluating alternatives such as Sqoop, DataX, Apache Hop, and AWS Glue, the team selected Apache SeaTunnel for its high‑performance distributed architecture, Java SPI extensibility, and support for both Spark and Flink engines.

SeaTunnel was integrated into Shopee's internal Datahub and DataStudio, enabling users to create, schedule, and preview ETL pipelines via a user‑friendly UI, while handling source, transform, and sink configurations in a concise YAML‑like format.

The migration involved refactoring legacy batch‑fetch components into SeaTunnel source connectors, simplifying job orchestration into four steps—fetch, partition, transform, and table creation—and adding heartbeat checks to ensure data quality.

Key features delivered include easy configuration editing, built‑in preview that writes intermediate results to S3, support for multiple connectors (File, JDBC, Elasticsearch, HBase, etc.), and a streamlined deployment process using Spark 2.0.5 jars uploaded through DataStudio.

Future work focuses on adding Spark 3.0 support, expanding Flink‑engine connectors, building observability metrics, and contributing back to the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ETL Apache data integration Shopee SeaTunnel

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.