Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution
This article introduces Xiaomi's sales data warehouse practices, covering its development history, positioning, architecture, dimensional modeling, layer theory, capability building, real‑time and batch processing using Lambda architecture, Iceberg, Flink, and Hologres, and discusses future trends and Q&A.
Introduction – The article presents the practice of Xiaomi's data‑middle‑platform department in building a sales data warehouse, outlining its evolution, positioning, content, role, and scale.
1. Sales Data Warehouse Overview – Describes the warehouse’s development from siloed warehouses before 2019 to a unified platform guided by the ABC (AI, Big data, Cloud) strategy, detailing its data sources (orders, products, stores, after‑sale, logistics, logs) and the dimensions modeled for orders, logistics, and user behavior.
2. Warehouse Construction Theory – Explains business analysis, theme domain definition, fact and dimension table design, dimensional modeling, layer separation (ODS, DWD, DWM, DIM, DM, ADS, TMP), and key modeling principles such as high cohesion, low coupling, public logic sinking, cost‑performance balance, consistency, and data rollback.
3. Architecture – Shows that Xiaomi adopts a Lambda architecture: batch processing with Spark + Hive, stream processing with Flink + Talos, DW/DW layers accelerated by Hologres, and integration of offline and real‑time data. Discusses challenges like state expiration in Flink and solutions using offline streams.
4. Capability Layer – Highlights unified data architecture, real‑time minute‑level processing on Iceberg, Flink + Talos for second‑level streaming, strict development and quality standards, data security compliance (GDPR, privacy), and the use of a data encyclopedia for metric definitions and governance.
5. Summary and Outlook – Summarizes the achievements of the offline sales warehouse, its extensive usage across the company, and future directions focusing on data value‑creation and real‑time metrics.
Q&A – Provides answers to six questions covering refund handling, permission layers, replacement of Kudu with Hologres, DWD/DWM distinctions, access to lower layers, and storage of dimension metrics.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.