Big Data 16 min read

Delta Lake Principles, eBay Migration, and Practical Enhancements

This talk by eBay software engineer Zhu Feng explains the fundamentals of Delta Lake and Lakehouse architecture, outlines eBay’s migration from Teradata to a Spark‑based platform, and details the custom enhancements, performance optimizations, and operational improvements implemented to support large‑scale update and delete workloads.

DataFunTalk
DataFunTalk
DataFunTalk
Delta Lake Principles, eBay Migration, and Practical Enhancements

Speaker Zhu Feng, a Ph.D. and eBay software engineer, introduces the basic principles of Delta Lake and shares its application and transformation at eBay.

Delta Lake Principles – Lakehouse Architecture

Delta Lake was proposed at VLDB and CIDR conferences to combine the storage flexibility of data lakes with the governance capabilities of traditional data warehouses, forming the Lakehouse architecture. Traditional warehouses lack flexibility and are costly to scale, while pure data lakes suffer from consistency, quality, sharing, and update/delete limitations.

Delta Lake solves these issues by adding a transaction log (delta log) that provides ACID guarantees, version control, and time‑travel capabilities. The log records JSON commit files that describe added, removed, or metadata‑changed files, enabling snapshot reconstruction for reads and writes.

Project Background at eBay

Before 2018, eBay used Teradata, which was expensive. The goal was a seamless migration to a Spark‑SQL based platform (Carmel) that retained Teradata functionality, MPP performance, and handled massive update/delete workloads. Carmel, deployed on SSDs with a shared Hive Metastore, achieved interactive query latencies (80% under 27 s).

eBay needed a lake solution that supported frequent updates and deletes, leading to the evaluation of Hudi, Iceberg, and Delta Lake. Delta Lake was chosen because of its tight integration with Spark and its richer feature set.

Key challenges in 2019‑2020 included:

Delta Lake 0.4 lacked SQL support.

Spark 2.3 did not support Delta Lake.

Spark 3.x initially missed row‑level update/delete.

Performance gaps and missing operational tooling.

After iterative releases (initial functionality in Feb 2020, full rollout in May 2020, Spark upgrade to 3.0 in Dec 2020, and Delta 0.8 in Nov 2021), eBay’s platform incorporated Delta Lake with custom enhancements.

Transformation and Practice

Enhancements focused on three areas: functionality, performance, and usability.

Functionality : Added cross‑table UPDATE/DELETE support by extending the Spark SQL grammar (Sqlbase.g4) and injecting custom analysis rules. Implemented permission control, small‑file compaction (table‑level and partition‑level), and enforced constraints (CHECK, NOT NULL, UNIQUE, PRIMARY KEY) via additional plan nodes.

Performance : Optimized MERGE INTO to use right‑outer joins only when necessary, pushed filters early, and reduced unnecessary joins, achieving 5‑10× speedups in many cases. Added metrics for row‑level updates/deletes and improved plan generation.

Usability : Built an asynchronous metadata service that records Delta table events in MySQL, enabling vacuum scheduling without overloading Hive Metastore. Developed transparent table‑compact service that respects operation flags to avoid interfering with user jobs. Added SQL commands for vacuum configuration and Delta‑to‑plain‑table conversion. Improved DESCRIBE and ANALYZE output for V2 tables and exposed time‑travel rollback via SQL.

Additional details include handling of snapshot growth, periodic vacuum based on MySQL‑tracked Delta tables, and a flag‑file mechanism to ensure compact operations run only when tables are idle.

Overall, the presentation covered the background, core concepts, eBay’s migration journey, and the detailed engineering work required to make Delta Lake production‑ready for large‑scale update and delete workloads.

Data EngineeringBig DataData LakeSparklakehouseDelta LakeeBay
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.