Big Data 11 min read

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.

Architect
Architect
Architect
An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake solution originally developed by Uber, designed to ingest, manage, and incrementally process large analytical datasets on Hadoop‑compatible file systems like HDFS and S3.

Key capabilities include:

Ingesting and managing massive datasets on HDFS with reduced write latency.

Using Spark for updates, inserts, and deletes on HDFS data.

Providing stream primitives such as upserts and incremental pulls.

Supporting insert/update operations on Parquet files.

Integrating with the Hadoop ecosystem (Spark, Hive, Parquet) via custom InputFormats.

Enabling data recovery through savepoints.

Supporting Spark 2.x (recommended 2.4.4+).

Basic Architecture

Unlike OLTP‑oriented systems like Kudu, Hudi targets Hadoop‑compatible storage and relies heavily on Spark for incremental processing and rich query capabilities, supporting both batch and streaming pipelines where Hudi can act as a source or sink.

Hudi tables integrate with engines such as Hive, Spark, and Presto, exposing snapshot, incremental, and read‑optimized query abilities.

Timeline

Each Hudi table maintains a Timeline composed of Instants, which record actions (e.g., COMMIT, CLEAN, DELTA_COMMIT, COMPACTION, ROLLBACK, SAVEPOINT), timestamps, and states (REQUESTED, INFLIGHT, COMPLETED).

The Timeline enables efficient incremental pulls by locating only the files changed after a given time, avoiding full scans of older data.

Files and Indexing

Hudi stores tables under a base path, partitioned by directory. Within each partition, file groups (identified by a unique file ID) contain file slices, each consisting of a base Parquet file and zero or more log files that capture subsequent inserts/updates. MVCC is used; compaction merges logs into new base files, while clean removes obsolete slices.

Records are identified by a HoodieKey (record key + partition path) which maps to a file group.

Table Types

Hudi supports two table layouts:

Copy‑On‑Write (COW) : Stores data only in columnar Parquet files; each update rewrites entire files, leading to write amplification.

Merge‑On‑Read (MOR) : Stores columnar base files plus row‑oriented delta log files (e.g., Avro); updates are appended to logs and merged lazily during compaction, balancing read and write performance.

Query Types

Snapshot Query : Returns the latest view of data as of the most recent commit or compaction.

Incremental Query : Returns only data written after a specified commit/compaction.

Read Optimized Query : Reads only the latest base Parquet files, ignoring un‑compacted delta logs.

Engine Support Matrix

Both COW and MOR tables can be queried by external engines such as Hive, Spark SQL, and Presto, with each engine supporting the three query modes.

References

Apache Hudi Home

Hudi Concepts

Hudi vs. Alternatives

Querying Data with Hudi

big datadata lakeSparkApache HudiIncremental ProcessingQuery Types
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.