Design and Practice of Xiaomi’s One‑Stop Data Production Platform
This article presents a comprehensive overview of Xiaomi’s data production platform, detailing the full data lifecycle, the technical‑driven product design methodology, the platform’s architecture and core capabilities, as well as real‑world case studies and a Q&A session that illustrate how the system improves data collection, storage, processing, and usage across the organization.
The article introduces Xiaomi’s one‑stop data production platform, outlining its purpose to unify fragmented data development tools and address issues such as permission management, performance, and data consistency.
It explains the complete data lifecycle in five stages—generation, collection, storage, processing, and application—using a water‑cycle analogy, and emphasizes that the platform focuses on the first four stages, collectively called the data production chain.
Detailed discussions cover data generation (online and offline sources), collection methods (client‑side, server‑side, IoT devices, questionnaires, and web crawling), and storage choices (relational databases, NoSQL, message queues, file systems, and big‑data stores, with Xiaomi adopting Iceberg and Hologres).
The processing phase is broken down into ETL, data cleaning, and offline/real‑time development, highlighting how these steps transform raw data into valuable assets for downstream applications.
A technical‑driven product perspective is presented, describing core characteristics such as heavy reliance on underlying technology, performance focus, and a user base of engineers; the role of product managers in translating technical capabilities into user‑friendly solutions is also discussed.
Three practical case studies illustrate platform evolution: (1) a technology‑led architecture upgrade introducing Iceberg for a data lake, (2) a product‑driven efficiency improvement using drag‑and‑drop lineage and SQL parsing, and (3) performance enhancements by integrating Presto and Spark 3.x, enabling multi‑engine queries.
The article then outlines the platform’s construction roadmap, from initial analysis and MVP development (0→1) to scaling and standardization (1→10), and finally to continuous innovation while maintaining focus on technical users.
A Q&A section answers common questions about monitoring mechanisms, engine division of labor between Presto and Doris, the roles of data lake versus data warehouse, and product manager involvement in technology selection.
Overall, the piece demonstrates how a well‑designed, technology‑driven data production platform can streamline data workflows, improve efficiency, and support a wide range of analytical and operational use cases within a large organization.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.