How to Build a Real-Time Data Warehouse: Architectures, Challenges, and Industry Practices
This article examines the growing demand for real‑time data warehouses, compares mature streaming frameworks, evaluates Lambda, Kappa and hybrid architectures, reviews industry implementations from Didi and OPPO, and proposes a standard‑layer + stream + data‑lake solution with Apache Paimon, Hudi, and Iceberg.
Real-Time Data Warehouse Construction Background
Companies increasingly need real‑time data for product decisions and internal governance, but traditional offline warehouses operate on a T+1 schedule with daily batch jobs, which cannot meet low‑latency requirements.
1. Urgent Real‑Time Demand
Business scenarios now require sub‑hour or second‑level data freshness, making the classic offline approach insufficient.
2. Maturing Real‑Time Technologies
Streaming frameworks have evolved through three generations—Storm, Spark Streaming, and Flink—allowing SQL‑based development and tighter integration with offline warehouse designs. Development platforms also provide better support for debugging and operations, reducing costs.
Purpose of Building a Real‑Time Warehouse
1. Solve Traditional Warehouse Issues
The goal is to combine classic warehouse theory with streaming techniques to overcome the low timeliness of offline data.
Business decisions increasingly depend on real‑time data.
Lack of standards for real‑time data leads to poor usability and resource waste.
Platform tools now support real‑time development, lowering costs.
2. Real‑Time Warehouse Use Cases
Real‑time OLAP analysis / interactive queries
Real‑time dashboards
Real‑time business monitoring
Real‑time metric aggregation
Real‑time data service APIs
Real‑Time Warehouse Architecture Design
The original data‑warehouse concept was proposed by Inmon in 1990. With the explosion of data, big‑data tools replaced classic warehouse components, forming an offline big‑data architecture.
As real‑time requirements grew, an acceleration layer was added on top of the offline architecture, creating the Lambda architecture.
Later, with more event‑driven sources, the architecture shifted to a Kappa model that treats streaming as the core.
1. Lambda Architecture
To meet real‑time metric needs, a streaming pipeline is added to the offline warehouse, ingesting data via message queues and performing incremental calculations before merging with batch results.
Maintains two codebases for batch and stream processing.
Uses stream engines (e.g., Flink) for real‑time data and batch engines (e.g., Spark) for offline data.
Duplicate logic increases resource consumption.
Requires many components (Hadoop, Hive, Spark, Oozie, Flink, Kafka, Kudu, etc.), raising operational complexity.
Stream Computing https://cloud.tencent.com/product/oceanus?from=20065&from_column=20065
Batch Computing https://cloud.tencent.com/product/batch?from=20065&from_column=20065
2. Kappa Architecture
Kappa simplifies Lambda by converting all sources to streams and using a single streaming engine for both batch and real‑time processing, reducing operational overhead.
Kappa is essentially Lambda without the batch part.
Historical reprocessing throughput is lower than batch but can be mitigated by adding resources.
Challenges include data loss, out‑of‑order data, and schema synchronization.
Migrating legacy offline data is also a concern.
3. Hybrid Architecture
Completely replacing offline ETL with streaming is risky; many organizations adopt a hybrid approach, using both Lambda and Kappa where appropriate.
Lambda Kappa Hybrid4. Deep Dive into Real‑Time Warehouse Architecture
Real‑Time Query Requirements
Understanding industry demands helps evaluate design trade‑offs and maximize value under existing constraints.
Real‑time scenarios are split into two categories: sub‑second/millisecond monitoring and alerting, and minute‑level reporting (e.g., 10‑30 minutes).
Common solutions include:
Lambda architecture
Kappa architecture
Standard layer + stream + batch
Standard layer + stream + data lake
Full‑scene MPP databases (e.g., ClickHouse, Doris)
Solution 1: Kappa
Data from multiple sources is sent to Kafka, processed by Flink, and written to MySQL/Elasticsearch/HBase/Druid for downstream queries.
Advantages: simple design, real‑time data.
Disadvantages: each new report requires a new Flink job; large data volumes demand sizable Flink clusters and high memory usage.
Solution 2: Standard Layer + Stream
To reduce maintenance cost, data is organized into ODS, DWD, DWS, ADS layers. Raw data lands in ODS, Flink performs real‑time cleaning and transformation to produce DWD, which is then streamed to Kafka. DWS aggregates lightly, and ADS serves business‑specific applications.
Real‑time Computing https://cloud.tencent.com/product/oceanus?from=20065&from_column=20065
Pros: clear data responsibilities per layer.
Cons: multiple Flink jobs increase complexity; heavy Kafka usage raises load; schema management is cumbersome.
Solution 3: Standard Layer + Stream + Batch
Combines real‑time and offline processing by adding Spark‑based batch jobs on HDFS to the streaming pipeline.
Pros: supports both real‑time OLAP and large‑scale offline analytics.
Cons: data quality management is complex; schema unification is difficult; upsert not supported.
Solution 4: Standard Layer + Stream + Data Lake
To address data‑quality and upsert issues, a unified stream‑batch data‑lake architecture based on Delta Lake / Hudi / Iceberg is adopted.
Iceberg, for example, unifies storage and computation, supports both batch and streaming writes, and offers rich OLAP ecosystem compatibility.
Industry Real‑Time Warehouse Cases
1. Didi Ride‑Sharing Real‑Time Warehouse
Didi built a real‑time warehouse for its ride‑sharing business, achieving layered data (ODS, DWD, ADS), reduced resource consumption, and enriched data services.
2. OPPO Real‑Time Computing Platform
OPPO’s solution resembles the standard layer + stream model.
3. Didi Big Data Platform Architecture
Also follows the standard layer + stream approach.
Proposed Real‑Time Warehouse for "Micro‑Carp" Project
Based on the analysis, the recommended architecture is a standard‑layer system combined with stream processing and a data lake.
Current Warehouse Issues
Real‑time and offline warehouses are isolated, creating data islands.
Intermediate data is hard to query and debug.
Complex pipelines cause rollback difficulties.
Kudu integration with HDFS and cloud storage is problematic.
Planned New Architecture
Adopt Apache Paimon as the core lake format, supported by Flink for CDC, and integrate with OSS/S3/COS storage. Complement with Trino/Presto for OLAP and consider Doris/StarRocks for serving.
Technology Options
Apache Paimon
Provides fast ingestion, CDC support, and efficient real‑time analytics using LSM storage; compatible with Flink, Spark, Hive, Trino.
Apache Hudi
Offers indexed updates, incremental queries, ACID transactions, and CDC ingestion.
Apache Iceberg
Standardized table format with schema evolution, partitioning, snapshotting, and broad engine support.
Migration Plan
Phase 1: Introduce Paimon, test ingestion performance for event data and CDC.
Phase 2: Migrate selected jobs, validate stability in production.
Phase 3: Migrate all workloads and retire legacy components (Kudu, HBase, Druid, Impala).
Summary
The article surveys mainstream real‑time warehouse designs, compares their trade‑offs, and concludes that a standard‑layer + stream + data‑lake architecture best fits the company’s needs, with a phased migration to Apache Paimon and related ecosystem components.
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.