Big Data 12 min read

Exploring Real-Time Data Lake Practices at Kangaroo Cloud

This article shares Kangaroo Cloud's exploration and practice of a real-time data lake, covering background, data lake concepts, challenges, solution architecture using the Shuzhan platform with Iceberg/Hudi, CDC ingestion, small file handling, cross-cluster ingestion, materialized view acceleration, and future development plans.

DataFunSummit

Mar 25, 2024

Exploring Real-Time Data Lake Practices at Kangaroo Cloud

Kangaroo Cloud (袋鼠云) introduces its real‑time data lake initiative, beginning with a background overview of the company and the motivations for moving from a traditional lambda architecture to a unified lake architecture.

The Shuzhan (数栈) platform is presented as a self‑developed, one‑stop big‑data foundation that integrates storage engines (Hadoop, CDH, HDP) and provides both offline and real‑time development capabilities, along with data asset management and API services.

Key pain points of the previous lambda‑based solution are identified: duplicated storage/computation stacks, inefficient Kafka‑based streaming, and inconsistent data semantics between Spark and Flink jobs.

The real‑time data lake is described with four core capabilities—diverse analytics (batch, stream, interactive, ML), ACID transactions, comprehensive data management (formats, schemas), and scalable storage (HDFS, object storage)—which together reduce cost and improve efficiency.

Four open‑source lake table formats are compared (Iceberg, Hudi, Delta, Paimon), highlighting Hudi’s strong small‑file handling and transactional features, and noting Paimon’s emerging advantages.

Shuzhan’s lake solution combines the ChunJun CDC component for real‑time ingestion from RDBMS, Flink for unified batch‑stream processing, and the EasyLake management console for lake governance, achieving a true batch‑stream integrated storage layer.

Operational challenges such as small‑file explosion are mitigated by tuning checkpoint intervals (1‑5 minutes) and implementing EasyLake‑driven small‑file governance, while cross‑cluster ingestion is enabled through multi‑cluster Hadoop support.

Materialized view acceleration is explored: views are managed as special tables, with automatic matching and rewrite using inverted‑index techniques, and future support is planned for Spark, Trino, and Flink to create and refresh these views.

The roadmap includes improving platform usability (visual snapshot management), integrating Paimon for better streaming performance, enhancing lake ingestion performance, and strengthening data security for multi‑engine query scenarios.

A Q&A section addresses practical concerns such as Oracle read‑only support, latency expectations, schema evolution handling in Hudi, differences between Iceberg and Paimon, CDC capabilities, and the status of materialized view research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Iceberg CDC Hudi materialized view Cross-Cluster Ingestion Real-time Data Lake Shuzhan Platform

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.