Big Data 12 min read

Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)

This article explains the complete Apache Hudi write pipeline, detailing each step from client creation to commit, and describes the various write operations such as Upsert, Insert, Bulk Insert, Delete, Delete Partition, and Insert‑Overwrite, providing a comprehensive overview for data‑lake practitioners.

DataFunSummit

Aug 3, 2024

Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)

This article introduces Apache Hudi from zero to one, focusing on the write process and operations (Part 3). It is a translation of the original English blog post.

Main contents include an overview of the overall write flow, detailed write operations, and a final review.

Guest: Xu Shiyan, Onehouse open‑source project lead. Editor: Liu Jinhui. Community: DataFun.

Overall Write Flow

The diagram below shows the typical high‑level steps involved in a Hudi write operation within the execution engine context.

1. Create Write Client

The Hudi write client is the entry point for write operations. Instances are engine‑specific, e.g., SparkRDDWriteClient for Spark, HoodieFlinkWriteClient for Flink, and HoodieJavaWriteClient for Kafka Connect. User‑provided configurations are merged with existing table properties and passed to the client.

2. Convert Input

Before processing, the input data undergoes multiple transformations, constructing HoodieRecord objects and adapting their structure. The HoodieKey (recordKey + partitionPath) uniquely identifies a record and is populated via the KeyGenerator API.

3. Start Commit

The client checks the table timeline for any failed operations and creates a “requested” commit on the timeline to roll back if necessary before starting the write.

4. Prepare Records

Provided HoodieRecord s may be deduplicated and indexed according to user configuration and operation type. If deduplication is enabled, records with the same key are merged; if indexing is enabled, the current location is filled.

5. Partition Records

This pre‑write step determines which file groups each record belongs to, assigning them to update or insert buckets. Each bucket corresponds to an RDD partition (as in Spark).

6. Write to Storage

Physical data files are created or appended using write handles. Marker files may also be created under .hoodie/.temp/ to indicate the type of write operation, aiding efficient rollback and conflict resolution.

7. Update Index

After data is persisted, the index may be immediately updated to ensure read‑write correctness, especially for index types that are not updated synchronously during the write (e.g., HBase index).

8. Commit Changes

The client performs several tasks to finalize the transaction, such as pre‑commit validation, conflict checks, persisting commit metadata to the timeline, and coordinating with marker files via WriteStatus.

Write Operations

1. Upsert

The client starts a commit, prepares records (deduplication and indexing), partitions them into update and insert buckets, writes them using appropriate handles (merge for updates, create for inserts), and finally aggregates WriteStatus results to generate commit metadata.

2. Insert & Bulk Insert

Insert follows the same flow as Upsert but skips the indexing step, making it faster but potentially leaving duplicates. Bulk Insert also skips small‑file handling and can use row‑based writing for higher throughput.

3. Delete

Delete is a special case of Upsert where input records are converted to HoodieKey s only; the operation results in hard deletes, removing records from subsequent FileSlice s.

4. Delete Partition

Instead of input records, a list of physical partition paths is provided via hoodie.datasource.write.partitions.to.delete. The operation records a .replacecommit on the timeline, marking all file groups in those partitions as deleted.

5. Insert Overwrite & Insert Overwrite Table

Insert Overwrite rewrites affected partitions by marking existing file groups as deleted and writing new ones. Insert Overwrite Table applies the same logic to all partitions of the table.

Review

This article explored the main steps of the Hudi write path, delved into the CoW Upsert flow, explained record partitioning logic, and covered all other write operations. For more information, follow the Apache Hudi community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data lake Spark Apache Hudi Upsert Write Process

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.