How Shopee Leverages Paimon for Real‑Time Data Warehousing and Task Diagnosis
This article details Shopee's Data Infra team's use of the Paimon data lake to build near‑real‑time warehouses, accelerate ODS layers, implement a task‑diagnosis system, and create a reconciliation platform, while sharing future plans and a Q&A session.
01 Paimon Usage Overview
Currently Paimon is used at Shopee in three main scenarios: building a near‑real‑time data warehouse with StarRocks, applying Partial Update to replace double‑stream joins, and accelerating ODS layer upgrades using Paimon’s daily‑cut feature to improve timeliness and storage efficiency.
Based on Paimon and StarRocks to construct a near‑real‑time warehouse.
Partial Update engine replaces double‑stream Join, reducing state and resource consumption.
ODS layer upgrade acceleration using Paimon’s daily‑cut function.
02 Near‑Real‑Time Warehouse Construction
Task diagnosis system builds a near‑real‑time pipeline for Flink tasks, exposing back‑pressure, resource usage, and latency on the platform UI. The pipeline uses Paimon’s native Lookup Join and Aggregation Merge engines, writes dimension tables to HBase, and leverages Changelog for data propagation.
Key steps include:
Flink consumes binlog, writes to ODS layer; bucket count matches Kafka partitions.
Flink batch jobs verify historical data completeness and back‑fill missing data.
StarRocks mounts Paimon catalog for direct queries and materialized views.
03 ODS Layer Upgrade Acceleration
Challenges: daily data slices and late‑arriving data cause storage bloat (up to 187×) and timeliness issues. Paimon’s file structure (Snapshot, Manifest, Data) with LSM‑Tree and Branch feature solves redundancy and enables efficient daily‑cut handling.
Branch tables allow independent read/write while sharing underlying files, supporting fast back‑fill of late data without full rewrites.
04 Future Plans
Paimon has already delivered significant gains in timeliness and storage cost. Future work will explore more scenarios, especially new features in Paimon 1.x, and continue expanding its adoption at Shopee.
05 Q&A
Q1: StarRocks external vs internal table performance gap? A1: 20‑30% slower, better than expected. Q2: Using Partial Update for foreign‑key joins? A2: Currently not supported; need traditional Join then Partial Update. Q3: Choosing Paimon vs Hudi? A3: Paimon shows better performance and resource usage for streaming workloads. Q4: Does Changelog merging increase latency with more layers? A4: Yes, but can be mitigated with Full Compaction or other Changelog modes at the cost of extra compute.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.