Big Data 22 min read

Arctic: Efficient Management of Apache Iceberg Lakehouse Tables – Concepts, Practices, and Roadmap

This article introduces the Arctic lakehouse management system built on Apache Iceberg, explains Iceberg’s core principles, format versions, and real‑world implementations at NetEase, and details Arctic’s automated table optimization, governance workflows, and future development plans.

DataFunSummit

Apr 30, 2023

Arctic: Efficient Management of Apache Iceberg Lakehouse Tables – Concepts, Practices, and Roadmap

Data infrastructure development never stops, and the current hot trend is the lakehouse (Lakehouse). Arctic, an open‑source lakehouse management system from NetEase built on Apache Iceberg and other table formats, is introduced with practical usage guidance.

1. Apache Iceberg Overview and Principles

Iceberg is a table format designed for massive analytical workloads, offering schema evolution, hidden partitioning, partition evolution, time travel, rollback, streaming/batch unified reads and writes, serialization isolation, and concurrent writers.

Key advantages include easier management, hidden partitioning that removes the need for explicit partition columns, and the ability to evolve partitions without rewriting old data.

2. Core Architecture of Iceberg

Iceberg uses a catalog to manage tables (HadoopCatalog, HiveCatalog, JdbcCatalog) and stores metadata such as snapshots, manifest lists, manifest files, schemas, and partition specs.

Format version 2 adds row‑level deletes, position delete files, equality delete files, file sequencing, and rewrite (compaction) capabilities.

3. Iceberg Roadmap

Since the V1.0.0 release in November 2022, Iceberg has issued major releases every 2‑3 months, adding features like Spark merge‑on‑read updates, Z‑order sorting, Puffin file format, multi‑branch management, query statistics, CDC support, and secondary indexes.

4. Arctic – Efficient Iceberg Governance

Arctic provides a lakehouse‑level service that adds a mixed streaming format, automatic table optimization, and self‑optimizing capabilities to address file fragmentation, data redundancy, and snapshot cleanup.

Governance types include expiring snapshots, deleting orphan files, compacting data files, rewriting manifests, and full table rewrite.

Optimization tasks are triggered either on a schedule or on‑demand, with monitoring, planning, execution (via Flink, Spark, Yarn, or K8s), and commit phases managed by the Arctic service.

5. Arctic Deployment Process

Install AMS, register the Iceberg catalog, start the optimizer, and monitor results on the AMS dashboard.

6. Future Plans

Upcoming features include an overview dashboard, asynchronous global sorting, asynchronous secondary index construction, balanced optimization cost‑benefit analysis, multi‑region metadata management, Ranger integration for security, and watermark support.

7. Q&A Highlights

Answers cover Hudi support, resource usage comparisons, optimizer deployment on K8s, minor optimizer behavior, Iceberg vs. Hudi trade‑offs, row‑level updates with format V2, CDC capabilities, primary‑key query performance, AMS replacing HMS, multi‑region catalog management, secondary index timeline, S3 FileIO usage, and security integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Governance Apache Iceberg Lakehouse Arctic

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.