Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations
This article shares Cainiao's practical experience in constructing a real-time data warehouse, covering the shortcomings of the previous architecture, the evolution of data models, the migration to Flink with advanced features like retraction and timer services, and the modernization of data services and tooling to support high‑throughput logistics scenarios.
01. Previous Real-Time Data Architecture
The earlier architecture suffered from chaotic internal data model layers, high data usage cost, siloed development with no reuse, and significant data consistency gaps across business lines, making BI queries difficult.
Real-Time Computation
Cainiao initially used Alibaba Cloud JStorm and Spark Streaming, which could handle many cases but struggled with logistics supply‑chain requirements, failing to balance functionality, performance, stability, and rapid fault recovery.
Data Service
Real‑time data sinking into MySQL, HBase, etc., was inflexible, and BI permission control and end‑to‑end guarantees were unreliable.
02. Data Model Upgrade
1. Model Layering
Inspired by offline warehouses, real‑time data is layered: the first layer ingests data from MySQL into TT messaging middleware, joins with HBase dimension tables to produce wide fact tables, and writes back to TT. Two downstream layers—light aggregation and heavy aggregation—are built by subscribing to the TT stream.
2. Pre‑Split
A public data middle layer aggregates all business lines, then each line performs a horizontal split to create its own business‑specific middle layer, enabling resource‑saving upstream processing.
3. Cainiao Real‑Time Data Model
The public middle layer contains global order and logistics details; a split task separates data for domestic, import, and export supply chains, improving data usability and distinguishing tables for dashboards versus analytical queries.
03. Compute Engine Enhancement
In 2017, Cainiao switched from JStorm/Spark to Flink, leveraging Flink's full SQL support, state‑based retraction for order cancellations, CEP for timeout statistics, and auto‑scaling for resource optimization.
1. Retraction
Using Flink's last_value function, the engine captures the latest non‑null message for each order, automatically retracting outdated values to ensure correct aggregation.
2. Real‑Time Timeout Statistics
Flink's Timer Service is customized (overriding processElement and onTimer ) to generate synthetic timeout events for orders that have not been collected within a defined window, enabling accurate timeout counting.
3. From Manual to Intelligent Optimization
Flink's built‑in mechanisms (MiniBatch, LocalGlobal, PartialFinal) mitigate data skew, while AutoScaling predicts required resources based on upstream QPS, simplifying configuration for both peak and regular workloads.
04. Data Service Upgrade
Cainiao introduced the "TianGong" middleware to provide a unified database access standard, centralized permission control, and end‑to‑end guarantees. It translates NoSQL queries (e.g., HBase) into SQL, supports cross‑source data joins, and adds service safeguards such as automatic failover, read‑write splitting, slow‑query detection, and rate limiting.
05. Other Tooling Exploration and Innovation
A real‑time load‑testing tool was built to simulate traffic spikes and generate reports, while Flink‑based monitoring tracks latency, checkpoints, and TPS alerts.
06. Future Development and Thoughts
Cainiao plans to evolve toward batch‑stream hybrid processing and AI integration, leveraging Flink's batch capabilities to read offline dimension tables from MaxCompute, handling state loss during restarts, and exploring intelligent features such as de‑duplication and full‑link real‑time guarantees.
The article concludes with thanks and invites readers to join the DataFunTalk community for further big‑data and AI discussions.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.