Big Data 16 min read

Evolution of Cainiao's Real-Time Data Warehouse Architecture: Model, Compute Engine, and Data Service Upgrades

The talk details Cainiao’s evolution of its real‑time data warehouse architecture, covering the original 2016 model, compute and service challenges, the 2017 multi‑layer data model redesign, migration to Flink, practical cases of state retraction, timeout statistics, smart optimizations, and the unified data service platform.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Evolution of Cainiao's Real-Time Data Warehouse Architecture: Model, Compute Engine, and Data Service Upgrades

In the Flink Forward Asia conference, senior data technology expert Jia Yuanqiao from Cainiao Data & Planning presented the evolution of Cainiao's real‑time data warehouse architecture, focusing on data model, real‑time computation, and data service upgrades.

The original 2016 architecture suffered from a vertically siloed development model, high computation costs, lack of reuse, inconsistent data, and a monolithic data model without layering, leading to high usage costs. Real‑time computation relied on JStorm and Spark Streaming, which struggled with complex logistics and supply‑chain scenarios, and data services were scattered across HBase, MySQL, and ADB, causing high operational costs and uncontrolled data usage.

In 2017, Cainiao performed a major upgrade:

Data Model Upgrade : introduced a four‑layer model—data collection, fact detail layer, light aggregation layer (stored in ADB for high‑dimensional analytics), and heavy aggregation layer (stored in HBase for real‑time dashboards). This layered approach enabled horizontal reuse of a common intermediate data layer and reduced resource waste across business lines.

Compute Engine Upgrade : migrated from JStorm and Spark Streaming to Flink, leveraging Flink SQL, state‑based retraction, CEP for timeout statistics, AutoScaling, and semi‑intelligent batch‑stream hybrid features to improve development efficiency, fault tolerance, and resource optimization.

Data Service Upgrade : built a unified middleware “TianGong Data Service” that abstracts HBase, MySQL, and OpenSearch, provides unified access, permission management, monitoring, and treats SQL as a first‑class DSL via HSF.

Three concrete cases illustrate the upgrades:

State‑based Retraction : using Flink’s built‑in state and last_value to correctly handle order cancellations and dynamic routing changes.

Timeout Statistics : employing Flink Timer Service and custom ProcessFunction to emit timeout events for logistics orders without incoming messages.

Smart Optimizations : applying MiniBatch, LocalGlobal, and PartialFinal to mitigate data skew, and using AutoScaling for dynamic resource provisioning in both peak (big‑promotion) and daily scenarios.

Additional innovations include cross‑source SQL queries (TgSQL) for NoSQL to relational conversion, unified data service guarantees (master‑slave switching, dual‑active, hotspot blocking, whitelist throttling), real‑time load testing tools, end‑to‑end job monitoring, and future directions toward batch‑stream convergence and AI‑driven analytics.

Overall, Cainiao’s journey demonstrates how a layered data model, Flink‑based streaming engine, and unified data service platform can address scalability, reliability, and operational challenges in large‑scale supply‑chain real‑time analytics.

big dataFlinkstreamingdata modelingreal-time data warehouseData Service
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.