Big Data 17 min read

Rebuilding Shopee's Data Integration Platform with Apache SeaTunnel

Shopee faced fragmented data‑ingestion pipelines, limited source support, and high maintenance overhead, so it evaluated open‑source tools and adopted Apache SeaTunnel to unify batch and streaming data transfers, simplify ETL workflows, and provide a scalable, extensible solution for its multi‑TB daily data processing needs.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Rebuilding Shopee's Data Integration Platform with Apache SeaTunnel

Shopee generates several terabytes of data daily and needed a unified big‑data platform to support diverse data‑ingestion jobs, many of which were unmanaged, non‑standard, and difficult to maintain.

The existing ecosystem suffered from limited source types (primarily MySQL and TiDB), opaque job management, long execution times, and complex custom pipelines for Hive, ClickHouse, Druid, and other storages.

After evaluating alternatives such as Sqoop, DataX, Apache Hop, and AWS Glue, the team selected Apache SeaTunnel for its high‑performance distributed architecture, Java SPI extensibility, and support for both Spark and Flink engines.

SeaTunnel was integrated into Shopee's internal Datahub and DataStudio, enabling users to create, schedule, and preview ETL pipelines via a user‑friendly UI, while handling source, transform, and sink configurations in a concise YAML‑like format.

The migration involved refactoring legacy batch‑fetch components into SeaTunnel source connectors, simplifying job orchestration into four steps—fetch, partition, transform, and table creation—and adding heartbeat checks to ensure data quality.

Key features delivered include easy configuration editing, built‑in preview that writes intermediate results to S3, support for multiple connectors (File, JDBC, Elasticsearch, HBase, etc.), and a streamlined deployment process using Spark 2.0.5 jars uploaded through DataStudio.

Future work focuses on adding Spark 3.0 support, expanding Flink‑engine connectors, building observability metrics, and contributing back to the open‑source community.

Big DataETLApachedata integrationShopeeSeaTunnel
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.