Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg
This article introduces NetEase’s Arctic, a real‑time lakehouse system built on Apache Iceberg that unifies streaming and batch processing, explains the challenges of Lambda architecture, details Arctic’s features such as change/base stores, hidden queue, transaction handling, and shares internal practice cases and future roadmap.
Arctic is NetEase’s real‑time lakehouse solution built on Apache Iceberg, designed to overcome the data isolation and low efficiency caused by the traditional Lambda architecture where streaming and batch processes are separated.
The current business faces challenges such as data islands due to independent Kudu streams, limited reuse of offline data, and fragmented development workflows that hinder unified metrics and semantics.
Arctic provides a TableService layer above Hive and Iceberg, offering table schema optimization and encapsulating KV stores like Kafka, Redis, and HBase. It separates stream writes into a Change store and batch writes into a Base store , delivering upsert semantics, primary‑key uniqueness, small‑file governance, and merge‑on‑read capabilities.
Two optimization cycles are employed: a short‑interval Minor Optimize (5‑10 minutes) for small‑file cleanup and a daily Major Optimize that merges change files into base files, making the base store fully Hive‑compatible. A hidden queue wraps Kafka for millisecond‑level CDC, and consistency is ensured via checkpoint‑based retract messages in the Arctic‑Flink connector.
The Arctic Meta Service (AMS) acts as a future HMS, managing table metadata, transaction IDs, and triggering optimization tasks based on time or file size, with a user‑friendly dashboard for operations.
Compared with Hudi and Kudu, Arctic inherits Iceberg’s MVCC and ACID guarantees, offers better Hive compatibility, supports real‑time subscriptions and joins through the hidden queue, and provides a more extensible foundation for future lakehouse evolution.
In practice, NetEase Cloud Music uses Arctic to unify its push‑notification analytics pipeline, allowing analysts to run batch reports and instantly switch to streaming joins without architectural changes, achieving a single source of truth for both batch and real‑time workloads.
Future plans include deeper stream‑batch integration with roll‑up views, sort‑key and Z‑order support, temporal joins, lineage tracking, open permission plugins (e.g., Ranger), and expanding storage back‑ends to S3 and OSS.
The presentation concludes with thanks to the audience.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.