An Overview of Apache Hudi: Architecture, Concepts, and Query Types
Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.
Apache Hudi is an open‑source data‑lake solution originally developed by Uber, designed to ingest, manage, and incrementally process large analytical datasets on Hadoop‑compatible file systems like HDFS and S3.
Key capabilities include:
Ingesting and managing massive datasets on HDFS with reduced write latency.
Using Spark for updates, inserts, and deletes on HDFS data.
Providing stream primitives such as upserts and incremental pulls.
Supporting insert/update operations on Parquet files.
Integrating with the Hadoop ecosystem (Spark, Hive, Parquet) via custom InputFormats.
Enabling data recovery through savepoints.
Supporting Spark 2.x (recommended 2.4.4+).
Basic Architecture
Unlike OLTP‑oriented systems like Kudu, Hudi targets Hadoop‑compatible storage and relies heavily on Spark for incremental processing and rich query capabilities, supporting both batch and streaming pipelines where Hudi can act as a source or sink.
Hudi tables integrate with engines such as Hive, Spark, and Presto, exposing snapshot, incremental, and read‑optimized query abilities.
Timeline
Each Hudi table maintains a Timeline composed of Instants, which record actions (e.g., COMMIT, CLEAN, DELTA_COMMIT, COMPACTION, ROLLBACK, SAVEPOINT), timestamps, and states (REQUESTED, INFLIGHT, COMPLETED).
The Timeline enables efficient incremental pulls by locating only the files changed after a given time, avoiding full scans of older data.
Files and Indexing
Hudi stores tables under a base path, partitioned by directory. Within each partition, file groups (identified by a unique file ID) contain file slices, each consisting of a base Parquet file and zero or more log files that capture subsequent inserts/updates. MVCC is used; compaction merges logs into new base files, while clean removes obsolete slices.
Records are identified by a HoodieKey (record key + partition path) which maps to a file group.
Table Types
Hudi supports two table layouts:
Copy‑On‑Write (COW) : Stores data only in columnar Parquet files; each update rewrites entire files, leading to write amplification.
Merge‑On‑Read (MOR) : Stores columnar base files plus row‑oriented delta log files (e.g., Avro); updates are appended to logs and merged lazily during compaction, balancing read and write performance.
Query Types
Snapshot Query : Returns the latest view of data as of the most recent commit or compaction.
Incremental Query : Returns only data written after a specified commit/compaction.
Read Optimized Query : Reads only the latest base Parquet files, ignoring un‑compacted delta logs.
Engine Support Matrix
Both COW and MOR tables can be queried by external engines such as Hive, Spark SQL, and Presto, with each engine supporting the three query modes.
References
Apache Hudi Home
Hudi Concepts
Hudi vs. Alternatives
Querying Data with Hudi
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.