Ant Group's Real-Time Data Warehouse Architecture, Solutions, and Data Lake Outlook
This article presents Ant Group's recent explorations and practices in real-time data warehousing, covering the system architecture, streaming data quality assurance, flow‑batch integrated applications, and future data lake integration, while sharing technical details and operational insights for large‑scale data processing.
Introduction – The article shares Ant Group's exploration and practice in the real‑time data warehouse field over the past two to three years.
1. Real‑time Data Warehouse Architecture – The architecture consists of six modules: compute engine, development platform, compute resources, real‑time assets, development tools, and data quality. It handles data sources such as online logs, database logs, and real‑time messages, using Flink and ODPS for computation and various storage engines (Sls, Explorer, HBase). Real‑time meta tables are used to define and manage assets, supporting schema, production, consumption, and quality management.
2. Real‑time Data Solution – Describes conversion attribution using real‑time graph databases, handling high traffic with Flink joins or HBase dimension tables, and generating real‑time conversion attribution tables. It also discusses deduplication strategies (minute‑level aggregation, Flink Cumulate Window, dimension‑table join, HyperLogLog, Bitmap) for UV/UV counting during high‑traffic events.
3. Real‑time Data Quality Assurance – Quality is ensured in pre‑ and mid‑process stages. Pre‑process includes code debugging, pressure testing, and setting rate limits. Mid‑process monitors task exceptions (delay, failover, checkpoint failures) and data anomalies (zero values, variance, threshold breaches). End‑to‑end monitoring covers source, runtime, sink, and consumption layers, with metrics collected into dashboards.
4. Flow‑Batch Integrated Application – Explains the need for aligning streaming and batch data schemas, using virtual columns to handle mismatched fields, and a hybrid meta‑table approach to compute cumulative metrics in a single job. It details Flink batch engine architecture (scheduler on K8s, engine layer optimizations, connector tuning, Shuffle mechanisms) and the integration of batch tasks into the offline scheduler.
5. Data Lake Outlook – Discusses extending the flow‑batch integration to storage, adopting Paimon as the data‑lake component for near‑real‑time computation from ODS to ADM layers. It highlights three scheduling granularities (daily, hourly, real‑time) and how Paimon simplifies the data pipeline by providing a unified compute engine, storage, and asset management.
Conclusion – The presentation summarizes the architecture, solutions, quality measures, flow‑batch integration, and data‑lake plans for Ant Group's real‑time data warehouse.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.