Big Data 11 min read

Apache Hudi from Zero to One – Part 2: Reading Process and Query Types (Spark Example)

This article explains how Apache Hudi integrates with Spark to read data, detailing the Spark‑SQL planning stages, the Spark‑Hudi read workflow, and the four main Hudi query types—snapshot, read‑optimized, time‑travel, and incremental—along with example SQL commands and code snippets.

DataFunSummit
DataFunSummit
DataFunSummit
Apache Hudi from Zero to One – Part 2: Reading Process and Query Types (Spark Example)

This article, translated from the original English blog, continues the "Apache Hudi from Zero to One" series by focusing on how reading operations are implemented in Hudi, using Spark as the example engine.

Spark SQL is a distributed SQL engine that parses user queries, generates a logical plan, optimizes it, and then creates a physical plan for execution. The three stages—analysis, logical optimization, and physical planning—are performed by Spark's Catalyst optimizer.

When Spark reads a Hudi table, it uses the Spark data source API. The integration entry point is DefaultSource , which defines the data source format as org.apache.hudi or hudi . The read workflow proceeds through the following steps:

Build scan: buildScan() passes filters to the data source for query optimization.

Collect file splits: collectFileSplits() gathers relevant files via a FileIndex .

File index lookup: FileIndex finds all FileSlice objects to be processed.

Compose RDD: after identifying FileSlice , composeRDD() creates an RDD.

Load and read: FileSlice objects are loaded as RDDs, with column‑pruning for Parquet files.

Return RDD: the RDD is returned for further query planning and code generation.

The article then describes four Hudi query types:

1. Snapshot Query

The default query type that returns the latest records (a table snapshot). For Merge‑on‑Read tables, it may need to merge log files with base files.

create table hudi_mor_example (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
) location '/tmp/hudi_mor_example';

set hoodie.spark.sql.insert.into.operation=UPSERT;
insert into hudi_mor_example select 1, 'foo', 10, 1000;
insert into hudi_mor_example select 1, 'foo', 20, 2000;
insert into hudi_mor_example select 1, 'foo', 30, 3000;

Running SELECT id, name, price, ts FROM hudi_mor_example; returns the latest values.

2. Read‑Optimized (RO) Query

Provides lower latency at the cost of possibly stale data and works only on MoR tables. It reads only the base files of FileSlice .

select id, name, price, ts from hudi_mor_example_ro;

3. Time‑Travel Query

Allows users to query the table as of a specific timestamp, retrieving historical snapshots.

select id, name, price, ts from hudi_mor_example timestamp as of '20230905221619987';
select id, name, price, ts from hudi_mor_example timestamp as of '20230905221619986';

4. Incremental Query

Enables retrieval of records changed within a given time range, supporting CDC mode for full change‑data capture.

The article concludes by summarizing the covered Spark Catalyst optimizer concepts, the Spark‑Hudi read integration, and the four query types, and hints at a forthcoming article on the Hudi write process.

Big Datadata lakeSparkApache Hudiread optimizationQuery Types
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.