Big Data 15 min read

Ant Group's Real-Time Data Warehouse Architecture, Solutions, and Data Lake Outlook

This article presents Ant Group's recent explorations and practices in real-time data warehousing, covering its modular architecture, data quality assurance mechanisms, stream‑batch integration techniques, graph‑based conversion attribution, and future data‑lake implementation using Paimon.

DataFunSummit
DataFunSummit
DataFunSummit
Ant Group's Real-Time Data Warehouse Architecture, Solutions, and Data Lake Outlook

Ant Group's real‑time data warehouse consists of six core modules—computation engine, development platform, compute resources, real‑time assets, development tools, and data quality—forming an end‑to‑end pipeline that addresses challenges such as asset management, resource control, and platform robustness.

The solution emphasizes continuous data quality monitoring both before and during execution, employing task‑level diagnostics, pressure testing, limit‑rate settings, and real‑time anomaly detection for tasks (e.g., latency, failover, checkpoint failures) and data (e.g., zero values, trend deviations, threshold breaches).

For conversion attribution, a real‑time graph model is built where user, traffic, and conversion events are linked; paths are generated, cycles removed, and flattened into attribution tables for downstream analysis, supplemented by future edge‑side logging and near‑real‑time data‑lake construction.

Stream‑batch integration is achieved by aligning streaming and batch schemas, using virtual columns for mismatched fields, and leveraging hybrid meta‑tables to compute daily aggregates, with Flink handling both real‑time and batch workloads, optimized through K8s scheduling, BlinkSQL extensions, adaptive batch scheduling, and remote shuffle mechanisms.

Looking ahead, Ant adopts Paimon as the data‑lake component, enabling unified storage and near‑real‑time processing across ODS, ADM, and other layers, simplifying the overall data‑engineering workflow while supporting multi‑granularity scheduling (daily, hourly, real‑time).

big dataFlinkstream processingdata qualityreal-time data warehousedata lake
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.