Analysis of Lakehouse Storage Systems: Design, Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Apache Hudi, and Apache Iceberg
This article examines the architecture and core design of lakehouse storage systems, compares the metadata handling and Merge‑On‑Read mechanisms of Delta Lake, Apache Hudi, and Apache Iceberg, and presents practical performance‑optimization techniques and real‑world case studies on Alibaba Cloud EMR.
Guide This article analyzes lakehouse storage systems, focusing on metadata, Merge‑On‑Read (MOR) design, and performance optimization, and compares three popular formats: Delta Lake, Apache Hudi, and Apache Iceberg.
Lakehouse System (Alibaba Cloud EMR) A lakehouse combines the high performance of data warehouses with the low‑cost openness of data lakes. It builds on open‑source formats (Delta Lake, Iceberg, Hudi) that use Parquet/ORC on inexpensive storage such as OSS or S3, providing ACID transactions, batch‑stream unification, and upserts.
Learning the system involves configuring file sizes, handling small files, cleaning strategies, and performance tuning. The discussion uses Spark as the compute engine to illustrate read/write paths.
Core Design
1. Metadata All three formats store schema, configuration, and a list of valid data files as metadata in the table directory. Delta Lake writes a new JSON deltalog file for each commit and periodically creates a checkpoint Parquet file that aggregates previous versions. Iceberg uses a three‑layer architecture with a metadata file that contains snapshot information and manifest files for fine‑grained statistics. Hudi groups files into file‑groups, encodes file names, and requires a primary key to map records to groups.
Delta Lake metadata loading flow:
Locate the latest checkpoint metadata file.
List subsequent delta‑log JSON files.
Parse them in order to reconstruct schema, configuration, and the list of valid data files.
Iceberg metadata loading flow:
Locate the current metadata file to obtain schema, configuration, and the snapshot’s manifest‑list.
Parse the manifest‑list to get a set of manifest files.
Parse each manifest file to retrieve the list of valid data files.
Hudi metadata loading flow:
Parse hoodie.properties to obtain schema and configuration.
Obtain the list of valid files by listing the filesystem, grouping by file‑group, selecting the latest file per group, and filtering out deleted groups via the timeline.
If metadata is enabled, read the metadata table directly; otherwise fall back to filesystem + timeline.
A comparative diagram (image) shows the differences among the three formats.
2. Merge‑On‑Read (MOR) MOR avoids the write‑amplification of Copy‑On‑Write by persisting only new data and marking old data as obsolete. Hudi was the first to implement MOR with file‑groups and log files. Delta Lake uses Deletion Vectors (DV) to record offsets of rows to delete or update, while Iceberg V2 offers a similar position‑based MOR implementation.
Figures illustrate Hudi’s file‑group layout, Delta Lake’s DV mechanism, and a performance comparison of MOR implementations.
Performance Optimization
1. Query The Spark query path includes three lake‑related stages: metadata loading, plan optimization, and table scan.
(1) Metadata loading Can be single‑node (Iceberg, Delta Lake‑Standalone) for small tables or distributed (Hudi, Delta Lake) for large tables. Benchmarks on the LHBench TPC‑H store‑sales table show that distributed loading reduces startup latency for massive tables.
EMR Optimization Cases
Case 1 – EMR Manifest: Pre‑materialize partition‑level manifests to skip full metadata loading, achieving sub‑second query latency on a 300 TB table.
Case 2 – EMR DataLake Metastore: Centralized service that stores file‑level snapshots and supports both Hudi and Iceberg, enabling metadata dual‑write and future extensions such as data profiling and row‑level indexes.
(2) Plan optimization Spark uses statistics (table‑level row count, column min/max) for Adaptive Query Execution (AQE) and Cost‑Based Optimization (CBO). Proper statistics collection can turn a full‑table count into a local relation.
(3) Table Scan Optimizations include partition pruning, manifest pruning (Iceberg), file‑level pruning, Z‑order, and data‑skipping. Both Delta Lake and Hudi support Bloom filters or multi‑modal indexes for point lookups.
Images illustrate file‑size tuning (e.g., 1 GB files for 10 TB tables) and the impact of small‑file merging.
2. Write Updating data involves locating relevant files, applying expressions, and writing new files. Post‑write services such as clean, checkpoint, and compaction are critical for stability. Strategies include separating the write path from a background job that performs table‑service tasks.
EMR provides automated table‑management policies that trigger small‑file merging or Z‑order after data ingestion, improving query performance.
Thank you for attending the session.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.