NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap
This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.
NetEase recently launched an internal data lake project named Arctic to address the limitations of traditional data lakes and meet its own data processing needs.
What is a data lake? A data lake supports data ingestion (both batch and streaming), stores structured, semi‑structured, and unstructured data, and provides a unified schema for various compute engines. Compared with data warehouses, it offers greater flexibility but often lacks strong governance.
Existing open‑source data lake solutions such as Delta Lake , Apache Hudi , and Apache Iceberg each have strengths (transaction isolation, real‑time updates, simple codebase) but none fully satisfy NetEase's requirements.
NetEase's requirements include:
Streaming‑batch integration: a single platform that supports real‑time and batch writes/reads, compatible with both Flink (real‑time) and Spark (batch).
Compatibility: seamless use of existing Hive tables without migration.
High performance, low latency, ACID guarantees, schema evolution, and file management.
Arctic core principles :
Architecture: Arctic sits between compute engines (Flink, Spark, Impala, Presto) and storage, exposing a unified table interface.
Table layout: each table has a base space for immutable data and a change space for incremental updates. Queries merge the two (Merge‑On‑Read) to provide up‑to‑date results.
Compaction: Minor compaction merges small change files; Major compaction periodically folds change data into the base space to avoid read amplification.
Heterogeneous storage: the base space remains managed by Hive for legacy compatibility, while the change space uses Apache Iceberg for versioning and flexible file handling; Kafka is employed for millisecond‑level change data distribution.
Current integration includes Flink for real‑time ingestion, Spark for batch processing, and Impala/Presto for analytical queries.
Results and future plans :
Arctic has been deployed in NetEase Cloud Music, unifying real‑time and offline table objects and enabling simultaneous Flink and Spark development, which resolves schema inconsistencies inherent in Lambda architectures.
Future work focuses on improving query performance for Presto/Impala, extending Merge‑On‑Read to include Kafka‑streamed data, further exploring streaming‑batch integration at the development‑level, and integrating data quality and lineage services from the data‑middle‑platform.
Open‑source release is planned for the first half of 2022.
Overall, Arctic aims to provide a high‑performance, compatible, and governance‑friendly data lake solution that bridges the gap between real‑time and batch processing for NetEase's diverse data workloads.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.