Big Data 16 min read

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform architecture and three real‑time lake initiatives—log ingestion, CDC ingestion, and lake analysis—showcasing how Apache Iceberg, Flink, and custom shuffling algorithms solve small‑file and cross‑cloud challenges while enabling schema evolution and future multi‑cloud optimizations.

DataFunSummit
DataFunSummit
DataFunSummit
Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

The presentation introduces Xiaohongshu's data platform, which runs on a multi‑cloud native stack where logs and RDBMS sources are collected, most data is stored in AWS S3, and processing combines Kafka, Flink for streaming and Spark/Hive/Presto for batch, with OLAP engines such as ClickHouse, StarRocks, and TiDB for serving.

Three real‑time lake directions are covered:

Log data lake ingestion using Iceberg to replace the previous OSS‑to‑S3 pipeline, eliminating small‑file explosion by leveraging Iceberg's transactional writes and an EvenPartitionShuffle algorithm that balances partition load based on Fanout and Residual metrics.

CDC real‑time lake ingestion, where MySQL binlog is streamed via Kafka, Flink upserts into Iceberg with exactly‑once semantics, handling schema evolution by detecting column additions and dynamically restarting writers, while using hidden partitions and merge‑on‑read to reduce DeleteFile overhead.

Real‑time lake analysis, exploring stream‑batch unified storage by dual‑writing Kafka data to Iceberg columnar tables, integrating Iceberg external tables with ClickHouse, and planning future query acceleration via data skipping and secondary indexes.

Multi‑cloud read/write challenges are addressed by selecting HiveCatalog for atomic metadata updates and implementing a custom S3FileIO (with HttpsClients, API timeout, and credential provider tweaks) to achieve stable cross‑cloud reads and writes.

Additional optimizations include lowering MPU thresholds and Parquet row‑group sizes to reduce Flink checkpoint latency, handling ResetException by extending BufferedInputStream mark limits, and progressive compaction strategies.

The roadmap focuses on three pillars: storage (further Cloud‑Native FileIO improvements), compute (integrating more engines like Spark, ClickHouse, StarRocks, and adding indexing), and management (service‑ifying Iceberg maintenance jobs and intelligent scheduling).

Overall, the work demonstrates how Iceberg can power a high‑throughput, low‑latency real‑time data lake in a multi‑cloud environment, achieving GB/s ingest rates, million‑level RPS, and 30‑50% query latency reductions.

cloud-nativeBig DataFlinkApache IcebergCDCreal-time data lake
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.