Big Data 16 min read

Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations

This article shares Cainiao's practical experience in constructing a real-time data warehouse, covering the shortcomings of the previous architecture, the evolution of data models, the migration to Flink with advanced features like retraction and timer services, and the modernization of data services and tooling to support high‑throughput logistics scenarios.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations

01. Previous Real-Time Data Architecture

The earlier architecture suffered from chaotic internal data model layers, high data usage cost, siloed development with no reuse, and significant data consistency gaps across business lines, making BI queries difficult.

Real-Time Computation

Cainiao initially used Alibaba Cloud JStorm and Spark Streaming, which could handle many cases but struggled with logistics supply‑chain requirements, failing to balance functionality, performance, stability, and rapid fault recovery.

Data Service

Real‑time data sinking into MySQL, HBase, etc., was inflexible, and BI permission control and end‑to‑end guarantees were unreliable.

02. Data Model Upgrade

1. Model Layering

Inspired by offline warehouses, real‑time data is layered: the first layer ingests data from MySQL into TT messaging middleware, joins with HBase dimension tables to produce wide fact tables, and writes back to TT. Two downstream layers—light aggregation and heavy aggregation—are built by subscribing to the TT stream.

2. Pre‑Split

A public data middle layer aggregates all business lines, then each line performs a horizontal split to create its own business‑specific middle layer, enabling resource‑saving upstream processing.

3. Cainiao Real‑Time Data Model

The public middle layer contains global order and logistics details; a split task separates data for domestic, import, and export supply chains, improving data usability and distinguishing tables for dashboards versus analytical queries.

03. Compute Engine Enhancement

In 2017, Cainiao switched from JStorm/Spark to Flink, leveraging Flink's full SQL support, state‑based retraction for order cancellations, CEP for timeout statistics, and auto‑scaling for resource optimization.

1. Retraction

Using Flink's last_value function, the engine captures the latest non‑null message for each order, automatically retracting outdated values to ensure correct aggregation.

2. Real‑Time Timeout Statistics

Flink's Timer Service is customized (overriding processElement and onTimer ) to generate synthetic timeout events for orders that have not been collected within a defined window, enabling accurate timeout counting.

3. From Manual to Intelligent Optimization

Flink's built‑in mechanisms (MiniBatch, LocalGlobal, PartialFinal) mitigate data skew, while AutoScaling predicts required resources based on upstream QPS, simplifying configuration for both peak and regular workloads.

04. Data Service Upgrade

Cainiao introduced the "TianGong" middleware to provide a unified database access standard, centralized permission control, and end‑to‑end guarantees. It translates NoSQL queries (e.g., HBase) into SQL, supports cross‑source data joins, and adds service safeguards such as automatic failover, read‑write splitting, slow‑query detection, and rate limiting.

05. Other Tooling Exploration and Innovation

A real‑time load‑testing tool was built to simulate traffic spikes and generate reports, while Flink‑based monitoring tracks latency, checkpoints, and TPS alerts.

06. Future Development and Thoughts

Cainiao plans to evolve toward batch‑stream hybrid processing and AI integration, leveraging Flink's batch capabilities to read offline dimension tables from MaxCompute, handling state loss during restarts, and exploring intelligent features such as de‑duplication and full‑link real‑time guarantees.

The article concludes with thanks and invites readers to join the DataFunTalk community for further big‑data and AI discussions.

big dataFlinkstreamingdata modelingreal-time data warehouseData Service
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.