Big Data 8 min read

Real‑Time Data Warehouse Practices with Apache Kudu: Architecture, Partitioning, and Platformization

This article reviews the challenges of building a real‑time data warehouse, compares Lambda and Kappa architectures, introduces Apache Kudu’s master‑tablet design, storage model and partition strategies, and shares practical experiences and future directions for a Kudu‑based streaming analytics platform.

Tongcheng Travel Technology Center

Nov 19, 2021

Real‑Time Data Warehouse Practices with Apache Kudu: Architecture, Partitioning, and Platformization

Only data with actual effect is the right path; short‑lived data can be more valuable. The article starts by describing the background of real‑time data warehouse at Tongcheng Travel, emphasizing the need for CDC support and the exploration of various solutions.

1.1 T+1 – Increasing task execution frequency from daily to hourly and even 5‑minute intervals improves data freshness, but introduces bottlenecks such as task delay, high scheduling requirements, resource reservation, and the need for continuous task optimization.

1.2 Lambda Architecture – Splits the warehouse into offline (batch) and real‑time (stream) layers, processing the same data twice; however, it suffers from high development cost, consistency issues, and extra storage for stream processing.

1.3 Kappa Architecture – Uses a message queue as storage and stream processing to avoid the dual‑framework cost of Lambda, yet it has weak replay capability, limited OLAP support, and potential data inconsistency.

1.4 Other Approaches – Briefly mentions Elasticsearch (good for search, poor for large scans and joins), HBase+Phoenix (OLTP‑oriented, weak for OLAP), and Kylin streaming cubes (poor CDC support).

1.5 Real‑time Warehouse Requirements – Lists five requirements: ACID & schema changes, upsert support, batch‑stream read/write, unified storage, and OLAP query capability.

2 Kudu Introduction – Defines Apache Kudu as an open‑source distributed storage engine for fast analytics on fast‑changing data.

2.2 Kudu Architecture – Master‑slave design with Master nodes managing metadata and Tablet Servers storing data; Raft protocol elects leaders and followers.

2.3 Underlying Data Model – Describes Table/Tablet/Replica hierarchy, RowSets (MemRowSets in memory using B‑trees, flushed to DiskRowSets), and column‑oriented storage similar to Parquet, enabling OLAP.

2.4 Partition Strategies – Supports hash, range, and hybrid partitions; includes illustrative diagrams for each strategy.

3 Kudu‑Based Real‑time Warehouse – Shows the end‑to‑end solution: data collected via Kafka, ingested by Flink into Kudu, transformed by Spark, and queried ad‑hoc via Trino/Presto.

3.1 Platformization – Provides a unified SQL‑based platform: data ingestion layer (Flink SQL), ETL layer (Spark SQL with up to 5‑minute frequency), and ad‑hoc layer (federated queries across Kudu and HDFS).

3.2 Practical Experience – Highlights importance of table design (range + hash partitioning), warns about data bloat from frequent full‑refresh ETL, and advises reading from leader replicas to avoid follower‑read errors.

3.3 Summary – Kudu offers strong OLTP and OLAP capabilities but requires SSDs, ample memory, and platform support; it lacks native HDFS integration, so historical data remains in a lake with federated queries.

4 Future Outlook – Plans include migrating Kudu to Kubernetes, unifying SQL dialects across Flink, Spark, and Presto, and integrating Kudu into a lake‑warehouse architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Real-time Data Warehouse Lambda architecture Partitioning Kappa architecture Apache Kudu

Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.