Big Data 13 min read

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

The article reviews the current state of offline Hive‑based data warehouses, explains the emergence of real‑time data warehouses (1.0) built on Kafka and Flink, discusses their limitations, and outlines the progression toward batch‑stream unified architectures (2.0 and 3.0) leveraging data‑lake technologies such as Iceberg.

Big Data Technology Architecture

Apr 5, 2021

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

Data processing today is dominated by mature offline Hive‑based data warehouses, but growing business demand for real‑time reports has shifted industry focus to building real‑time data warehouses and to batch‑stream unified architectures.

Real‑time data processing can be divided into two scenarios: ultra‑low‑latency (seconds or milliseconds) for monitoring and dashboards, and minute‑level latency (10‑30 minutes) for most real‑time reports. The former often writes Flink‑processed results directly to MySQL, Elasticsearch, HBase, Druid, Kudu, etc.

The latter follows a traditional warehouse layering approach and is referred to as a "real‑time warehouse". The prevailing architecture (real‑time warehouse 1.0) combines Kafka and Flink, as illustrated below:

This architecture still respects the classic warehouse layers: data enters the ODS layer, is cleaned and merged into the DWD detail layer, lightly aggregated into the DWS layer, and finally organized into the ADS application layer for user‑profile and reporting use cases.

The upper part of the diagram shows the offline data flow, while the lower part depicts the real‑time flow; many companies deviate from this strict layering in practice.

However, the Kafka+Flink real‑time warehouse suffers from several clear drawbacks:

Kafka cannot store massive amounts of data; it typically retains only a few days or weeks.

Kafka does not support efficient OLAP queries, making ad‑hoc analysis difficult.

Existing offline lineage, quality, and governance tools cannot be reused, requiring a separate implementation.

The Lambda architecture incurs high maintenance cost due to duplicated data, schemas, and processing logic.

Kafka only supports append‑only writes, lacking update/upsert capabilities needed for delayed‑data corrections.

To address these issues, the industry is moving toward a batch‑stream unified approach (real‑time warehouse 2.0) that relies on a unified storage layer—commonly a data lake built on Iceberg, Hudi, or Delta. The following diagram shows a typical Iceberg‑based architecture:

Unifying storage on a data lake solves the four earlier problems: large‑scale storage on HDFS, OLAP query support via compatible engines, reuse of lineage and quality frameworks, and a single schema that eliminates the need for parallel Lambda pipelines.

Iceberg also provides essential capabilities for a true real‑time warehouse:

Streaming writes with incremental pull, allowing downstream Flink jobs to consume exactly the newly written files.

Mechanisms to compact small files, mitigating the small‑file problem.

Batch and streaming upsert/delete support, enabling corrections of delayed data.

A rich OLAP ecosystem (Hive, Spark, Presto, Impala) for high‑performance multidimensional queries.

Some Iceberg features are still under development, and future articles will dive deeper into its implementation.

When the compute engine also achieves batch‑stream unification, we enter real‑time warehouse 3.0. For Spark‑centric companies, 2.0 already implies 3.0 because Spark already supports unified batch and streaming. The Spark‑based 3.0 architecture looks like this:

If Flink matures in batch processing, a Flink‑based 3.0 architecture would appear as follows:

In the author's view, most companies are still on real‑time warehouse 1.0 today; over the next one to two years, 2.0 will become the mainstream as data‑lake technologies mature, and 3.0 will follow as compute engines achieve full batch‑stream unification.

Author Bio

Zi He, a big‑data development engineer at NetEase, has long been engaged in distributed KV databases, time‑series databases, and core big‑data components.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Iceberg Batch-Stream Integration

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.