Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)
This article explains the complete Apache Hudi write pipeline, detailing each step from client creation to commit, and describes the various write operations such as Upsert, Insert, Bulk Insert, Delete, Delete Partition, and Insert‑Overwrite, providing a comprehensive overview for data‑lake practitioners.
This article introduces Apache Hudi from zero to one, focusing on the write process and operations (Part 3). It is a translation of the original English blog post.
Main contents include an overview of the overall write flow, detailed write operations, and a final review.
Guest: Xu Shiyan, Onehouse open‑source project lead. Editor: Liu Jinhui. Community: DataFun.
Overall Write Flow
The diagram below shows the typical high‑level steps involved in a Hudi write operation within the execution engine context.
1. Create Write Client
The Hudi write client is the entry point for write operations. Instances are engine‑specific, e.g., SparkRDDWriteClient for Spark, HoodieFlinkWriteClient for Flink, and HoodieJavaWriteClient for Kafka Connect. User‑provided configurations are merged with existing table properties and passed to the client.
2. Convert Input
Before processing, the input data undergoes multiple transformations, constructing HoodieRecord objects and adapting their structure. The HoodieKey (recordKey + partitionPath) uniquely identifies a record and is populated via the KeyGenerator API.
3. Start Commit
The client checks the table timeline for any failed operations and creates a “requested” commit on the timeline to roll back if necessary before starting the write.
4. Prepare Records
Provided HoodieRecord s may be deduplicated and indexed according to user configuration and operation type. If deduplication is enabled, records with the same key are merged; if indexing is enabled, the current location is filled.
5. Partition Records
This pre‑write step determines which file groups each record belongs to, assigning them to update or insert buckets. Each bucket corresponds to an RDD partition (as in Spark).
6. Write to Storage
Physical data files are created or appended using write handles. Marker files may also be created under .hoodie/.temp/ to indicate the type of write operation, aiding efficient rollback and conflict resolution.
7. Update Index
After data is persisted, the index may be immediately updated to ensure read‑write correctness, especially for index types that are not updated synchronously during the write (e.g., HBase index).
8. Commit Changes
The client performs several tasks to finalize the transaction, such as pre‑commit validation, conflict checks, persisting commit metadata to the timeline, and coordinating with marker files via WriteStatus .
Write Operations
1. Upsert
The client starts a commit, prepares records (deduplication and indexing), partitions them into update and insert buckets, writes them using appropriate handles (merge for updates, create for inserts), and finally aggregates WriteStatus results to generate commit metadata.
2. Insert & Bulk Insert
Insert follows the same flow as Upsert but skips the indexing step, making it faster but potentially leaving duplicates. Bulk Insert also skips small‑file handling and can use row‑based writing for higher throughput.
3. Delete
Delete is a special case of Upsert where input records are converted to HoodieKey s only; the operation results in hard deletes, removing records from subsequent FileSlice s.
4. Delete Partition
Instead of input records, a list of physical partition paths is provided via hoodie.datasource.write.partitions.to.delete . The operation records a .replacecommit on the timeline, marking all file groups in those partitions as deleted.
5. Insert Overwrite & Insert Overwrite Table
Insert Overwrite rewrites affected partitions by marking existing file groups as deleted and writing new ones. Insert Overwrite Table applies the same logic to all partitions of the table.
Review
This article explored the main steps of the Hudi write path, delved into the CoW Upsert flow, explained record partitioning logic, and covered all other write operations. For more information, follow the Apache Hudi community.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.