Delta Lake Overview, File Structure, Metadata, and Its Integration with Alibaba Cloud EMR, DLF, G‑SCD and CDC Solutions
This article introduces Delta Lake as an open‑source storage layer for lake‑house architectures, explains its key features, file and metadata structures, and details how Alibaba Cloud EMR and Data Lake Formation integrate and extend Delta Lake with advanced capabilities such as G‑SCD, CDC, performance optimizations, and future roadmap.
Delta Lake, an open‑source storage framework from Databricks, enables lake‑house architectures by providing ACID transactions, data versioning, parquet‑based storage, batch‑and‑stream unified reads/writes, schema evolution, and rich DML operations such as upsert, delete, and merge.
File structure : A Delta table consists of a _delta_log directory that stores JSON log files for each commit and periodic checkpoint parquet files, as well as data files (parquet) outside this directory. The log records actions like added or removed files, and snapshots are built from these logs.
Metadata structure : Each snapshot contains a protocol version, table metadata (schema and configuration), and a list of active data files represented by AddFile and RemoveFile actions. Loading a snapshot first looks for the nearest checkpoint and then applies subsequent log files.
EMR DeltaLake : Since 2019, Alibaba Cloud EMR has incorporated Delta Lake, adding features such as enhanced DML syntax, time‑travel SQL, partition management, auto‑optimize, auto‑vacuum, savepoints, rollback, and manifest customization to improve performance and integration with Hive, Presto, Trino, and other engines.
Deep integration with DLF : Delta Lake tables created in EMR automatically sync metadata to Data Lake Formation (DLF) metastore, eliminating the need for manual Hive external table creation. DLF also provides data ingestion from MySQL, RDS, and Kafka directly into Delta tables.
G‑SCD solution : By leveraging Delta Lake’s versioning and time‑travel, a granularity‑based slowly changing dimension (G‑SCD) is implemented, avoiding full‑snapshot storage and enabling efficient incremental updates with automatic savepoints and queryable snapshots.
CDC solution : EMR DeltaLake can act as a streaming source that captures change data (CDC) for every write operation, persisting it for downstream streaming queries. This enables end‑to‑end incremental pipelines from ODS to DWS layers without custom binlog mechanisms.
Future plans : Continue investing in Delta Lake on EMR, deepen DLF integration, enhance table operation tooling, reduce lake‑ingestion costs, and further optimize read/write performance within Alibaba Cloud’s big‑data ecosystem.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.