Big Data 9 min read

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

DataFunSummit
DataFunSummit
DataFunSummit
Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

Apache Hudi provides native support for incremental processing, which is essential for building up-to-date ELT pipelines in data lake‑warehouse architectures such as the Medallion model. The architecture consists of three layers—bronze, silver, and gold—each serving a specific purpose from raw ingestion to business‑ready data.

Incremental Query

Incremental queries are enabled by setting the following configuration properties:

hoodie.datasource.query.type=incremental
hoodie.datasource.read.begin.instanttime=202305150000
hoodie.datasource.read.end.instanttime=202305160000  # optional

Key points to note:

Setting hoodie.datasource.read.begin.instanttime=0 starts the query from the beginning of the table's history.

Omitting hoodie.datasource.read.end.instanttime retrieves all commits up to the current instant.

The result contains records updated within the specified time window, matching the latest commit version.

If the start time is 0 and the end time is omitted, the incremental query behaves like a snapshot query, returning the latest state of all records.

The workflow for incremental queries involves identifying relevant files via collectFileSplits() , filtering commits based on timestamps, and using composeRDD() to read only the necessary data.

Change Data Capture (CDC)

CDC, introduced in Hudi 0.13.0, extends incremental processing by providing before/after images of each record, allowing detection of inserts, updates, and deletes. To enable CDC, set the table property:

hoodie.table.cdc.enabled=true

CDC reads are performed by configuring the incremental query format:

hoodie.datasource.query.type=incremental
hoodie.datasource.query.incremental.format=cdc
hoodie.datasource.read.begin.instanttime=202305150000
hoodie.datasource.read.end.instanttime=202305160000  # optional

CDC logs are written alongside data files, and the read path can either pull the before/after fields directly from the CDC log or reconstruct them by looking up the current table state, depending on the logging mode.

By providing richer change information, CDC enables use cases such as detailed fraud detection, where understanding every modification to account balances is critical.

Review

The article summarized the concepts of incremental processing and the Medallion architecture, detailed how Hudi implements incremental queries and CDC, and highlighted the business value of CDC for gaining deeper insights from data changes.

Big Datadata lakeApache HudiIncremental ProcessingChange Data Capture
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.