Big Data 15 min read

Ant Group's Real-Time Data Warehouse Architecture, Solutions, and Data Lake Outlook

This article presents Ant Group's recent exploration of real-time data warehouse architecture, covering its six-module design, data quality assurance mechanisms, stream‑batch unified processing with Flink and ODPS, and a forward‑looking data lake solution built on Paimon, offering practical insights for large‑scale streaming analytics.

DataFunTalk
DataFunTalk
DataFunTalk
Ant Group's Real-Time Data Warehouse Architecture, Solutions, and Data Lake Outlook

Ant Group shares its recent two‑to‑three‑year exploration and practice in the real‑time data warehouse field, outlining the overall architecture, data quality guarantees, stream‑batch integration, and future data lake plans.

The real‑time data warehouse consists of six core modules—compute engine, development platform, compute resources, real‑time assets, development tools, and data quality—covering the entire real‑time data development lifecycle while addressing challenges such as asset management, resource control, and platform capability.

Data quality is ensured through both pre‑deployment measures (debugging, stress testing, rate limiting) and in‑process monitoring (task exception metrics, DQC rules, baseline monitoring), providing end‑to‑end assurance of timeliness and accuracy.

To bridge streaming and batch workloads, Ant Group adopts a unified stream‑batch approach using Flink and ODPS, low‑code development, virtual columns for schema alignment, and a mixed meta‑table that enables a single task to compute cumulative metrics across real‑time and offline data.

Looking ahead, the team plans to integrate a data lake using Paimon, consolidating storage and compute across ODS, ADM, and other layers, simplifying the pipeline with three scheduling granularities (daily, hourly, real‑time) and a unified asset model.

The presentation concludes with a summary of these innovations and an invitation for further discussion.

Big Dataflinkstream processingdata qualityreal-time data warehouseData Lake
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.