Big Data 20 min read

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation details Iceberg's core capabilities—transactional writes, schema evolution, implicit partitioning, and row‑level updates—while showcasing Xiaomi's real‑world applications such as log ingestion redesign, near‑real‑time warehousing, offline optimizations, column‑level encryption, Hive migration strategies, and outlining upcoming enhancements like materialized views and cloud migration.

DataFunTalk

Jun 26, 2023

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This talk introduces Iceberg, an open‑source table format that provides SQL behavior for structured data, supports multiple file formats, and integrates with engines like Flink, Spark, Hive, and Trino.

Core features include transactional atomic writes, full schema evolution (add, drop, rename, reorder columns), implicit partitioning that automatically manages partition directories, row‑level updates via position‑delete and equality‑delete files, and support for multiple storage backends (distributed or cloud).

Iceberg in Xiaomi is illustrated through several scenarios:

Log ingestion redesign: replacing a Spark‑Streaming pipeline with Flink SQL and Iceberg to achieve exactly‑once semantics, implicit partition correctness, and schema‑on‑write evolution.

Near‑real‑time data warehouse: using Flink + Iceberg with two‑level (date + event_name) partitioning to reduce scan volume and spread compute load throughout the day.

Offline challenges: handling partition completeness, watermark‑based completion detection, and optimizing Z‑order vs. local sort, as well as implementing page‑column indexes for Parquet.

Column‑level encryption: leveraging Parquet 1.12.2 encryption with a single‑layer DEK stored in Iceberg metadata, reducing KeyCenter calls.

Hive‑to‑Iceberg migration: three approaches—CALL migrate procedure, reusing Hive locations, and creating new Iceberg tables—each with trade‑offs regarding file formats, Spark version compatibility, and snapshot management.

Current deployment includes over 14,000 Iceberg tables storing more than 30 PB of data, with daily table growth surpassing Hive.

Future plans involve adding materialized view support for OLAP workloads, enabling Iceberg changelog view on Spark 3.3 for incremental reads, and exploring data lake migration to public cloud storage to reduce EBS costs.

Q&A highlights cover reasons for switching from Spark Streaming to Flink SQL, watermark configuration, lack of Hudi usage, challenges of zero‑downtime migration, latency expectations for append and upsert modes, and the use of local sort for multi‑column queries.

Example commands mentioned:

CALL catalog_name.db.sample('Spark_catalog.db.sample', map('foo', 'bar'))

set Spark.sql.sources.partitionOverwriteMode=static;

set Spark.sql.Iceberg.use-timestamp-without-timezone-in-new-tables=true

set table.exec.source.cdc-event-duplicate=true

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink data lake Spark schema evolution Column Encryption

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.