Big Data 14 min read

Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements

This article presents a comprehensive overview of Iceberg MOR principles, Arctic‑based performance optimizations, benchmark evaluations using CH‑benchmark, and future roadmap items, highlighting how various file‑type strategies, self‑optimizing mechanisms, and task balancing improve real‑time data lake query efficiency.

DataFunSummit
DataFunSummit
DataFunSummit
Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements

Introduction

This article shares performance optimization techniques for Iceberg real‑time data warehouses, covering four main aspects: Iceberg MOR principle, Arctic‑based optimizations, benchmark evaluation, and future plans.

01 Iceberg MOR Principle Introduction

1. MOR Overview – Merge On Read (MOR) is an out‑of‑place update technique that records changes separately and merges them at read time, offering low write cost but higher read cost, suitable for real‑time ingestion scenarios.

2. Three Iceberg File Types

Iceberg uses data‑files for normal inserts, equality‑delete‑files for row‑level deletions based on key matching, and position‑delete‑files that delete rows by file position.

3. Equality‑delete Mechanism

During reads, equality‑delete data is loaded into memory, a hash table is built on the specified columns, and matching rows are filtered out.

4. Position‑delete Mechanism

Two approaches are used:

Bitmap construction – position‑delete‑files are read into memory, a bitmap of row numbers to discard is built, and matching rows are omitted during data‑file reads.

Sort‑merge – both data‑files and position‑delete‑files are sorted by row number, allowing a merge‑join style elimination.

5. Iceberg File Organization and Task Structure

Each data‑file can be split into one or more tasks, which serve as the smallest read units.

02 Arctic Based on Iceberg Performance Optimization

1. Arctic Overview

Arctic is NetEase’s open‑architecture lake‑warehouse system built on Iceberg, offering stream‑and‑update‑oriented optimizations and a self‑optimizing mechanism.

2. Why Optimize

Challenges include small files from frequent checkpoints, excessive delete files increasing storage and read cost, inefficient data organization, and lingering stale files.

3. Self‑optimizing Features

Provides automatic execution, resource isolation via groups and quotas, and flexible deployment on Flink (YARN, K8s, AMS).

4. Small File Merging

Merges many small files into larger ones, reducing NameNode pressure and Iceberg metadata.

5. Delete File Elimination

Combines delete files with data files to reduce file count and read overhead.

6. Equality‑delete to Position‑delete Conversion

Transforms high‑memory‑cost equality‑deletes into low‑memory position‑deletes, improving query performance when delete ratios are low.

7. Self‑optimizing Performance Impact

Charts show that without self‑optimizing, Iceberg MOR performance degrades sharply after 60‑90 minutes, while with self‑optimizing it remains stable.

8. Delete File Reuse and Task Balancing

Issues with repeated delete file reads are mitigated by mixed Iceberg format strategies: file grouping by hash, delete file reuse across tasks, and balanced task assignment using a greedy partitioning algorithm.

Performance tests demonstrate significant gains in real‑time workloads.

9. Impactful Parameters

File type (Parquet vs. Avro) and compression type heavily influence query performance and resource consumption.

03 Optimization Effect Evaluation

1. TPC‑C & TPC‑H

Traditional benchmarks are not ideal for evaluating row‑level updates in data lakes.

2. CH‑benchmark

Combines TPC‑C transaction workload with adjusted TPC‑H queries to form a complex mixed workload, simulating CDC data ingestion and subsequent analytical queries.

04 Future Plans

Asynchronous global data sorting (including Z‑order).

Asynchronous secondary index construction.

performance optimizationbig datadata lakeicebergArcticMORSelf‑optimizing
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.