Big Data 20 min read

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation details Iceberg's core capabilities—transactional writes, schema evolution, implicit partitioning, and row‑level updates—while showcasing Xiaomi's real‑world applications such as log ingestion redesign, near‑real‑time warehousing, offline optimizations, column‑level encryption, Hive migration strategies, and outlining upcoming enhancements like materialized views and cloud migration.

DataFunTalk
DataFunTalk
DataFunTalk
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This talk introduces Iceberg, an open‑source table format that provides SQL behavior for structured data, supports multiple file formats, and integrates with engines like Flink, Spark, Hive, and Trino.

Core features include transactional atomic writes, full schema evolution (add, drop, rename, reorder columns), implicit partitioning that automatically manages partition directories, row‑level updates via position‑delete and equality‑delete files, and support for multiple storage backends (distributed or cloud).

Iceberg in Xiaomi is illustrated through several scenarios:

Log ingestion redesign: replacing a Spark‑Streaming pipeline with Flink SQL and Iceberg to achieve exactly‑once semantics, implicit partition correctness, and schema‑on‑write evolution.

Near‑real‑time data warehouse: using Flink + Iceberg with two‑level (date + event_name) partitioning to reduce scan volume and spread compute load throughout the day.

Offline challenges: handling partition completeness, watermark‑based completion detection, and optimizing Z‑order vs. local sort, as well as implementing page‑column indexes for Parquet.

Column‑level encryption: leveraging Parquet 1.12.2 encryption with a single‑layer DEK stored in Iceberg metadata, reducing KeyCenter calls.

Hive‑to‑Iceberg migration: three approaches—CALL migrate procedure, reusing Hive locations, and creating new Iceberg tables—each with trade‑offs regarding file formats, Spark version compatibility, and snapshot management.

Current deployment includes over 14,000 Iceberg tables storing more than 30 PB of data, with daily table growth surpassing Hive.

Future plans involve adding materialized view support for OLAP workloads, enabling Iceberg changelog view on Spark 3.3 for incremental reads, and exploring data lake migration to public cloud storage to reduce EBS costs.

Q&A highlights cover reasons for switching from Spark Streaming to Flink SQL, watermark configuration, lack of Hudi usage, challenges of zero‑downtime migration, latency expectations for append and upsert modes, and the use of local sort for multi‑column queries.

Example commands mentioned:

CALL catalog_name.db.sample('Spark_catalog.db.sample', map('foo', 'bar'))
set Spark.sql.sources.partitionOverwriteMode=static;
set Spark.sql.Iceberg.use-timestamp-without-timezone-in-new-tables=true
set table.exec.source.cdc-event-duplicate=true
big dataFlinkdata lakeSparkSchema EvolutionicebergColumn Encryption
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.