Big Data 12 min read

Exploring Real-Time Data Lake Practices at Kangaroo Cloud

This article shares Kangaroo Cloud's exploration and practice of a real-time data lake, covering background, data lake concepts, challenges, solution architecture using the Shuzhan platform with Iceberg/Hudi, CDC ingestion, small file handling, cross-cluster ingestion, materialized view acceleration, and future development plans.

DataFunSummit
DataFunSummit
DataFunSummit
Exploring Real-Time Data Lake Practices at Kangaroo Cloud

Kangaroo Cloud (袋鼠云) introduces its real‑time data lake initiative, beginning with a background overview of the company and the motivations for moving from a traditional lambda architecture to a unified lake architecture.

The Shuzhan (数栈) platform is presented as a self‑developed, one‑stop big‑data foundation that integrates storage engines (Hadoop, CDH, HDP) and provides both offline and real‑time development capabilities, along with data asset management and API services.

Key pain points of the previous lambda‑based solution are identified: duplicated storage/computation stacks, inefficient Kafka‑based streaming, and inconsistent data semantics between Spark and Flink jobs.

The real‑time data lake is described with four core capabilities—diverse analytics (batch, stream, interactive, ML), ACID transactions, comprehensive data management (formats, schemas), and scalable storage (HDFS, object storage)—which together reduce cost and improve efficiency.

Four open‑source lake table formats are compared (Iceberg, Hudi, Delta, Paimon), highlighting Hudi’s strong small‑file handling and transactional features, and noting Paimon’s emerging advantages.

Shuzhan’s lake solution combines the ChunJun CDC component for real‑time ingestion from RDBMS, Flink for unified batch‑stream processing, and the EasyLake management console for lake governance, achieving a true batch‑stream integrated storage layer.

Operational challenges such as small‑file explosion are mitigated by tuning checkpoint intervals (1‑5 minutes) and implementing EasyLake‑driven small‑file governance, while cross‑cluster ingestion is enabled through multi‑cluster Hadoop support.

Materialized view acceleration is explored: views are managed as special tables, with automatic matching and rewrite using inverted‑index techniques, and future support is planned for Spark, Trino, and Flink to create and refresh these views.

The roadmap includes improving platform usability (visual snapshot management), integrating Paimon for better streaming performance, enhancing lake ingestion performance, and strengthening data security for multi‑engine query scenarios.

A Q&A section addresses practical concerns such as Oracle read‑only support, latency expectations, schema evolution handling in Hudi, differences between Iceberg and Paimon, CDC capabilities, and the status of materialized view research.

IcebergCDCHudimaterialized viewCross-Cluster Ingestionreal-time data lakeShuzhan Platform
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.