Building a Unified Streaming‑Batch Storage Architecture at Xiaohongshu
This article presents Xiaohongshu's design and implementation of a unified streaming‑batch storage system that integrates Lambda architecture, Kafka, Flink, Iceberg, and modern OLAP engines to solve real‑time data warehouse pain points and enable consistent, exactly‑once analytics across streaming and batch workloads.
The talk focuses on creating a unified streaming‑batch storage solution for Xiaohongshu based on a data lake and stream storage.
Speaker: Zhang Yihhao, Message Queue Lead at Xiaohongshu (edited by Xu Jiangfeng, Shunwang Technology).
01. Lambda Architecture and Real‑time Data Warehouse Pain Points
Data platform overview: At the bottom are cloud resources (compute, storage). Above that is the data ingestion layer collecting raw logs, RDBMS change logs, and files. The storage and processing layer consists of two parts: offline (Hive, Spark on AWS S3) and near‑real‑time (Flink, Kafka). A data sharing layer stores aggregated, joined, and wide‑table data in engines such as ClickHouse, StarRocks, TiDB, and HBase. The top application layer provides reporting, ad‑hoc queries, and unified data products.
Typical Lambda practice at Xiaohongshu: Real‑time pipeline uses Flink and Kafka; offline pipeline uses S3, Spark, and Hive. The two pipelines are independent.
Real‑time warehouse pain points:
Inconsistent data between streaming and batch due to different engines, duplicated code, and differing TTLs.
Kafka lacks native data retrieval; users must rely on black‑box APIs or KSQL, which is unsuitable for large‑scale offline queries.
Stream storage has limited retention and high cost, making historical back‑tracking inefficient.
02. Unified Streaming‑Batch Storage Architecture Introduction
To address Lambda drawbacks, Xiaohongshu built a product called Morphing Server that provides a Kafka‑compatible API while asynchronously syncing data to an Iceberg data lake.
Key capabilities (6 points):
Unified stream and batch read/write.
Seamless, no‑change write interface (Kafka API) with automatic lake sync.
Schema parsing before lake write, exposing data as table format.
High‑speed analysis via StarRocks (vectorized, CBO).
Exactly‑once semantics for both stream and batch.
Rollback support for batch data when schema changes.
Design choices – Built‑in vs Extension:
Built‑in: A single process with an Iceberg Syncer thread that pulls logs and writes to Iceberg. Advantages: unified entry, no extra components, high resource utilization. Disadvantages: difficult cluster upgrades, weaker isolation.
Extension: Separate process outside Kafka, offering flexible integration, replaceable stream storage, and better isolation. Disadvantages: higher operational cost, more components, weaker product experience.
Query & analysis engine selection: Requirements – MPP architecture, vectorization, CBO, multi‑scenario support, scalability. Compared StarRocks (lake‑analysis, strong SQL compatibility) with Presto (better full‑scan but weaker aggregation). Chose StarRocks for its superior GroupBy performance.
RedCK (custom ClickHouse variant) architecture: Consists of Service (gateway, discovery), Query Processing (SQL parsing, plan generation), and Storage (HDFS, JuiceFS, OBS, COS). It reads/writes MergeTree format, enabling Spark/Flink to interact directly and providing OLAP capabilities.
Detailed component design:
Commit module: Manages Iceberg metadata, coordinates broker checkpoints, updates watermarks, and handles rollback.
Broker module: Fetches leader data from Kafka, parses schema, writes to Iceberg per partition using threads (ReplicaRemoteFetcherThread, DefultSchemaTransform, IcebergRemoteLogStorageManager, Schema Server).
Exactly‑once via two‑phase commit: 1) Committer sends checkpoint RPC to all brokers. 2) Brokers flush pending data to Iceberg and record checkpoint. 3) Brokers ACK the commit. 4) Committer records first‑phase commit and checkpoint ID. 5) Committer issues second‑phase commit to make data visible.
Failure handling:
Broker failure – Kafka leader election continues; new broker resumes async write using last checkpoint.
Controller failure – New controller re‑executes pending phases using checkpoint storage.
Object store/HMS failure – Unlimited retries with alerting.
03. Unified Streaming‑Batch Storage Application Practice
1. Kafka data retrieval: Previously required topic creation, DBA‑approved OLAP table, and Flink job. Now developers create a topic and can query the unified storage directly.
2. Strongly consistent ODS layer: Exactly‑once guarantees real‑time and batch data alignment, eliminating dual‑pipeline inconsistencies.
3. Batch partition backfill: Combines Flink’s unified compute with batch storage to dramatically improve backfill performance and data lifecycle.
4. Multi‑dimensional analysis: Leveraging StarRocks’ lake‑analysis, vectorization, and CBO to support user behavior analysis, profiling, self‑service reports, and cross‑domain analytics on a single platform.
Thank you for listening.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.