Big Data 12 min read

How Shopee Leverages Paimon for Real‑Time Data Warehousing and Task Diagnosis

This article details Shopee's Data Infra team's use of the Paimon data lake to build near‑real‑time warehouses, accelerate ODS layers, implement a task‑diagnosis system, and create a reconciliation platform, while sharing future plans and a Q&A session.

DataFunSummit

Jun 19, 2025

How Shopee Leverages Paimon for Real‑Time Data Warehousing and Task Diagnosis

01 Paimon Usage Overview

Currently Paimon is used at Shopee in three main scenarios: building a near‑real‑time data warehouse with StarRocks, applying Partial Update to replace double‑stream joins, and accelerating ODS layer upgrades using Paimon’s daily‑cut feature to improve timeliness and storage efficiency.

Based on Paimon and StarRocks to construct a near‑real‑time warehouse.

Partial Update engine replaces double‑stream Join, reducing state and resource consumption.

ODS layer upgrade acceleration using Paimon’s daily‑cut function.

02 Near‑Real‑Time Warehouse Construction

Task diagnosis system builds a near‑real‑time pipeline for Flink tasks, exposing back‑pressure, resource usage, and latency on the platform UI. The pipeline uses Paimon’s native Lookup Join and Aggregation Merge engines, writes dimension tables to HBase, and leverages Changelog for data propagation.

Key steps include:

Flink consumes binlog, writes to ODS layer; bucket count matches Kafka partitions.

Flink batch jobs verify historical data completeness and back‑fill missing data.

StarRocks mounts Paimon catalog for direct queries and materialized views.

03 ODS Layer Upgrade Acceleration

Challenges: daily data slices and late‑arriving data cause storage bloat (up to 187×) and timeliness issues. Paimon’s file structure (Snapshot, Manifest, Data) with LSM‑Tree and Branch feature solves redundancy and enables efficient daily‑cut handling.

Branch tables allow independent read/write while sharing underlying files, supporting fast back‑fill of late data without full rewrites.

04 Future Plans

Paimon has already delivered significant gains in timeliness and storage cost. Future work will explore more scenarios, especially new features in Paimon 1.x, and continue expanding its adoption at Shopee.

05 Q&A

Q1: StarRocks external vs internal table performance gap? A1: 20‑30% slower, better than expected. Q2: Using Partial Update for foreign‑key joins? A2: Currently not supported; need traditional Join then Partial Update. Q3: Choosing Paimon vs Hudi? A3: Paimon shows better performance and resource usage for streaming workloads. Q4: Does Changelog merging increase latency with more layers? A4: Yes, but can be mitigated with Full Compaction or other Changelog modes at the cost of extra compute.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink StarRocks Paimon Real-time Data Warehouse data lake Shopee Task Diagnosis

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.