Big Data 11 min read

Hudi Overview: Design, Architecture, and Use Cases from Uber

This article presents a comprehensive overview of Apache Hudi, covering its background, design motivations, architecture, view types, performance trade‑offs, compaction mechanisms, concurrency guarantees, and real‑world usage at Uber for managing petabyte‑scale data lakes.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Hudi Overview: Design, Architecture, and Use Cases from Uber

A presentation by three Hudi PMC members in 2018 about Hudi, introducing its background and design, which remains meaningful today.

The talk is divided into modules: background, motivation, design, use cases, and demo.

Uber’s rides in 2018 covered 700 cities, 70 countries, and over 2 million drivers.

In Uber, data can be split into ingestion and query; ingestion consumes data from Kafka and HDFS, while queries include Spark notebook for data scientists, Hive/Presto for ad‑hoc queries and dashboards, and Spark/Hive for building pipelines or ETL tasks. Introducing Hudi allows management of raw datasets with upsert, incremental semantics, and snapshot isolation.

This is a typical streaming‑batch analytics architecture, where both streaming and batch consume messages from middleware like Kafka; streaming provides sub‑minute latency, batch about one hour, and batch can correct streaming results, forming a classic Lambda architecture that requires maintaining two systems with high cost.

Using a data lake offers advantages: 1) supports ad‑hoc queries on the latest data; 2) near‑real‑time processing (micro‑batch) sufficient for many scenarios; 3) better handling of data, e.g., checking file sizes important for HDFS, without rewriting whole partitions; 4) lower maintenance cost as data does not need to be duplicated nor multiple systems maintained.

Hudi, an open‑source data‑lake framework from Uber, abstracts the storage layer (supports dataset changes and incremental processing), provides a Spark library (scales horizontally, stores data to HDFS), and is now an Apache incubating project.

Based on Hudi’s architecture, it supports upsert, incremental processing, and various views, allowing a Hudi‑based analytics stack to replace a typical Lambda stack by relying solely on Hudi to satisfy upper‑layer application needs.

Hudi manages datasets on HDFS, comprising indexes, data files, and metadata, and supports queries via Hive, Presto, and Spark.

Hudi provides three view types: read‑optimized view, real‑time view, and incremental view; the community is redefining them as read‑optimized view, snapshot view, and incremental view.

For COW tables, the read‑optimized view is supported; for MOR tables, both read‑optimized and real‑time views are supported; the latest release adds read‑optimized and incremental views for COW, and read‑optimized, real‑time, and incremental views for MOR.

In COW mode, the read‑optimized view reads only Parquet files; after the first upsert batch, it reads File1 and File2; after the second upsert batch, it reads File1' and File2.

COW solves many problems but has drawbacks such as high write latency because updating requires copying entire files.

The following workflow shows how delayed updates are handled: updates first affect the source table, then propagate to ETL table A, and finally to ETL table B, incurring larger latency.

From the analysis, the issues identified are high community latency, write amplification, limited data freshness, and small‑file problems.

Unlike COW where updates copy whole files, writing updates to an incremental file reduces ingestion latency and write amplification.

MOR mode provides read‑optimized and real‑time views.

After batch 1 upsert, the read‑optimized view reads Parquet files; after batch 2 upsert, the real‑time view reads a merged result of Parquet and log files.

Comparing the trade‑offs of different Hudi views: COW read‑optimized view offers native Parquet read performance but slower ingestion; MOR read‑optimized view also reads native Parquet but may see stale data; MOR real‑time view has high ingestion performance with merging at read time; compaction converts log files to Parquet, turning the real‑time view into a read‑optimized view.

For compaction, Hudi offers an MVCC‑based lock‑free asynchronous compaction, decoupling ingestion from compaction.

Asynchronous compaction merges log and data files into new data files, after which the read‑optimized view reflects the latest data.

Hudi also provides concurrency guarantees such as snapshot isolation and atomic batch writes.

Hudi use case sharing

At Uber, the self‑developed Marmaray consumes data from Kafka and writes to the Hudi lake, handling over 1000 datasets and 100 TB of data daily; the total managed dataset size has reached 10 PB.

For the typical small‑file issue on HDFS, Hudi automatically handles small files during ingestion to relieve NameNode pressure, supports large file writes, and enables incremental updates to existing files.

Hudi also addresses data privacy: it offers soft delete (removes content but keeps the key) and hard delete (removes both key and content).

Using Hudi’s incremental processing to build incremental pipelines and dashboards.

big datadata lakelambda architectureHudiIncremental ProcessingUpsert
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.