Real‑Time Data Warehouse Practices with Apache Kudu: Architecture, Partitioning, and Platformization
This article reviews the challenges of building a real‑time data warehouse, compares Lambda and Kappa architectures, introduces Apache Kudu’s master‑tablet design, storage model and partition strategies, and shares practical experiences and future directions for a Kudu‑based streaming analytics platform.
Only data with actual effect is the right path; short‑lived data can be more valuable. The article starts by describing the background of real‑time data warehouse at Tongcheng Travel, emphasizing the need for CDC support and the exploration of various solutions.
1.1 T+1 – Increasing task execution frequency from daily to hourly and even 5‑minute intervals improves data freshness, but introduces bottlenecks such as task delay, high scheduling requirements, resource reservation, and the need for continuous task optimization.
1.2 Lambda Architecture – Splits the warehouse into offline (batch) and real‑time (stream) layers, processing the same data twice; however, it suffers from high development cost, consistency issues, and extra storage for stream processing.
1.3 Kappa Architecture – Uses a message queue as storage and stream processing to avoid the dual‑framework cost of Lambda, yet it has weak replay capability, limited OLAP support, and potential data inconsistency.
1.4 Other Approaches – Briefly mentions Elasticsearch (good for search, poor for large scans and joins), HBase+Phoenix (OLTP‑oriented, weak for OLAP), and Kylin streaming cubes (poor CDC support).
1.5 Real‑time Warehouse Requirements – Lists five requirements: ACID & schema changes, upsert support, batch‑stream read/write, unified storage, and OLAP query capability.
2 Kudu Introduction – Defines Apache Kudu as an open‑source distributed storage engine for fast analytics on fast‑changing data.
2.2 Kudu Architecture – Master‑slave design with Master nodes managing metadata and Tablet Servers storing data; Raft protocol elects leaders and followers.
2.3 Underlying Data Model – Describes Table/Tablet/Replica hierarchy, RowSets (MemRowSets in memory using B‑trees, flushed to DiskRowSets), and column‑oriented storage similar to Parquet, enabling OLAP.
2.4 Partition Strategies – Supports hash, range, and hybrid partitions; includes illustrative diagrams for each strategy.
3 Kudu‑Based Real‑time Warehouse – Shows the end‑to‑end solution: data collected via Kafka, ingested by Flink into Kudu, transformed by Spark, and queried ad‑hoc via Trino/Presto.
3.1 Platformization – Provides a unified SQL‑based platform: data ingestion layer (Flink SQL), ETL layer (Spark SQL with up to 5‑minute frequency), and ad‑hoc layer (federated queries across Kudu and HDFS).
3.2 Practical Experience – Highlights importance of table design (range + hash partitioning), warns about data bloat from frequent full‑refresh ETL, and advises reading from leader replicas to avoid follower‑read errors.
3.3 Summary – Kudu offers strong OLTP and OLAP capabilities but requires SSDs, ample memory, and platform support; it lacks native HDFS integration, so historical data remains in a lake with federated queries.
4 Future Outlook – Plans include migrating Kudu to Kubernetes, unifying SQL dialects across Flink, Spark, and Presto, and integrating Kudu into a lake‑warehouse architecture.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.