Big Data 13 min read

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

This article examines the pain points of traditional data warehouse platforms, explains the core concepts and advantages of the Iceberg data lake table format, compares it with Metastore, reviews the current Iceberg community ecosystem, and details NetEase’s practical integration with Hive, Impala, and Flink to improve ETL efficiency and support unified batch‑stream processing.

DataFunTalk
DataFunTalk
DataFunTalk
NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

Introduction

NetEase data‑lake expert Fan Xinxin shares the motivations behind adopting Iceberg, starting from the limitations of their existing data‑warehouse platform and the need for a more efficient, reliable, and scalable solution.

Data‑Warehouse Platform Pain Points

Large offline jobs experience unpredictable latency due to massive data volumes, heavy NameNode requests, low ETL efficiency, and retry overhead.

Unreliable update operations cause failures when partitions are modified during reads.

Schema changes are costly because they require full data rewrites.

Lambda architecture incurs high maintenance cost, duplicate pipelines, and NameNode pressure.

Iceberg Core Principles

Iceberg is an open‑source table format that provides a high‑level abstraction independent of any execution engine. Its key features include:

Schema definition supporting primitive and complex types.

Partitioning expressed as a table column, eliminating extra NameNode list calls.

File‑level metadata (statistics per data file) enabling more effective predicate push‑down.

ACID‑compliant read/write APIs with snapshot commits.

Comparison with Metastore

Schema support is identical.

Iceberg stores partition values in the table itself, while Metastore treats partitions as directory structures, leading to extra HDFS list operations.

Iceberg’s statistics are at file granularity, offering finer‑grained pruning than Metastore’s table/partition level stats.

Iceberg writes use snapshot commits, providing atomicity and enabling incremental reads; Metastore relies on add‑partition calls.

Community Status

Iceberg currently supports Spark 2.4.5, Spark 3.x, and Presto. Ongoing work includes Hive and Flink integrations and adding update/delete capabilities.

NetEase Practical Implementation

Integrated Iceberg with Hive for table creation, deletion, and SQL queries.

Contributed Iceberg support to Impala, allowing both internal and external Iceberg tables.

Implemented a Flink sink for Iceberg, enabling streaming writes from Kafka and asynchronous small‑file merging via snapshot commits.

These integrations dramatically improve ETL job performance by reducing NameNode pressure, leveraging file‑level statistics for pruning, and providing a unified batch‑stream storage model.

Conclusion

Iceberg’s new partition model, metadata granularity, and API design address the four major pain points of traditional data warehouses, offering higher query performance, lower operational overhead, and seamless batch‑stream processing.

big dataFlinkHiveETLdata lakeicebergTable Format
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.