Big Data 9 min read

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

DataFunSummit

Sep 26, 2024

Apache Hudi provides native support for incremental processing, which is essential for building up-to-date ELT pipelines in data lake‑warehouse architectures such as the Medallion model. The architecture consists of three layers—bronze, silver, and gold—each serving a specific purpose from raw ingestion to business‑ready data.

Incremental Query

Incremental queries are enabled by setting the following configuration properties:

hoodie.datasource.query.type=incremental

hoodie.datasource.read.begin.instanttime=202305150000

hoodie.datasource.read.end.instanttime=202305160000  # optional

Key points to note:

Setting hoodie.datasource.read.begin.instanttime=0 starts the query from the beginning of the table's history.

Omitting hoodie.datasource.read.end.instanttime retrieves all commits up to the current instant.

The result contains records updated within the specified time window, matching the latest commit version.

If the start time is 0 and the end time is omitted, the incremental query behaves like a snapshot query, returning the latest state of all records.

The workflow for incremental queries involves identifying relevant files via collectFileSplits(), filtering commits based on timestamps, and using composeRDD() to read only the necessary data.

Change Data Capture (CDC)

CDC, introduced in Hudi 0.13.0, extends incremental processing by providing before/after images of each record, allowing detection of inserts, updates, and deletes. To enable CDC, set the table property: hoodie.table.cdc.enabled=true CDC reads are performed by configuring the incremental query format:

hoodie.datasource.query.type=incremental

hoodie.datasource.query.incremental.format=cdc

hoodie.datasource.read.begin.instanttime=202305150000

hoodie.datasource.read.end.instanttime=202305160000  # optional

CDC logs are written alongside data files, and the read path can either pull the before/after fields directly from the CDC log or reconstruct them by looking up the current table state, depending on the logging mode.

By providing richer change information, CDC enables use cases such as detailed fraud detection, where understanding every modification to account balances is critical.

Review

The article summarized the concepts of incremental processing and the Medallion architecture, detailed how Hudi implements incremental queries and CDC, and highlighted the business value of CDC for gaining deeper insights from data changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data lake Apache Hudi Incremental Processing Change Data Capture

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.