Big Data 11 min read

Understanding Hudi: Enabling Record‑Level Updates in Data Lakes

The article explains how Hudi enables efficient record‑level updates in data lakes by adapting database update concepts such as copy‑on‑write and merge‑on‑read, contrasting them with traditional RDBMS and NoSQL storage mechanisms and their trade‑offs.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding Hudi: Enabling Record‑Level Updates in Data Lakes

Before diving into Hudi's mechanism, let us first understand the problem Hudi is solving.

Customers using a data lake often ask: when a source record is updated, how can the data lake be updated? This is a difficult problem because once you write CSV or Parquet files, the only option is to rewrite them; there is no simple mechanism to open these files, locate a record, and update it with the latest value from the source. The problem becomes more severe when multiple layers of datasets exist in the lake, as each dataset's output serves as the input for the next computation.

In a database, a user can simply issue an update‑record command to accomplish the task, so from a database mindset it is hard to understand the above limitation—why can’t the same be done in a data lake?

RDBMS Update Principle

RDBMS stores data in a B‑Tree storage model, with data placed in data pages that can be located via indexes built on table columns. When an update command is issued, the RDBMS engine finds the exact page containing the record and updates the data within that page. This is a simplified description; most modern RDBMS engines have additional complexities such as multi‑version concurrency control, but the basic idea remains the same.

The figure below illustrates how a B‑Tree index finds the data page containing the value 13; the leaf nodes (third level) represent data pages, while the upper levels (first and second) contain index values.

Below is how some non‑SQL databases (e.g., Cassandra) handle updates:

Many NoSQL databases store data using an LSM‑Tree model, which is a log‑based storage model. New data (inserts/updates/deletes) are appended to an append‑only log, and the log is periodically merged back into data files so that the files stay up‑to‑date with all changes. This merge process is called compaction. When a record is updated, it is simply written to the append‑only log; the database engine then combines the log and data files to serve read queries. This is also a simplified description, but the core idea is similar.

The diagram below shows how new and updated data are added to the append‑only log (level 0) and eventually merged into larger files (levels 1 and 2).

Now that we have a basic understanding of how databases handle record‑level updates, let us see how Hudi works. Before Hudi (and similar frameworks like Delta Lake) appeared, the only way to apply updates to a data lake was to recompute and rewrite the entire CSV/Parquet file. As mentioned earlier, there is no simple mechanism to open a file and update a single record. Several reasons cause this limitation: we do not know which file contains the record to be updated, there is no efficient way to scan a file to find the desired record, and columnar formats such as Parquet cannot be updated in place—they must be recreated. Moreover, a data lake often contains multiple transformed layers, where a set of files is fed into the next set of computations, making it almost impossible to manage these dependencies during a single‑record update.

Hudi

The basic idea of the Hudi framework is to adopt the database update concept and apply it to a data lake. Hudi provides two "update" mechanisms:

Copy on Write (COW) – similar to RDBMS B‑Tree updates

Merge on Read (MOR) – similar to NoSQL LSM‑Tree updates

In addition, Hudi maintains the following:

Mapping of data records to files (similar to database indexes)

Tracking of the latest commit for each logical table in the lake

Ability to identify individual records in files based on a "record_key", which is required for all Hudi datasets and is analogous to a primary key in a database table

Hudi uses the above mechanisms together with a "precombine_key" to guarantee that duplicate records do not exist.

Copy on Write

In this model, when a record is updated, Hudi finds the file containing the record and rewrites that file with the updated values, while all other files remain unchanged. The update processing is therefore fast and efficient, and read queries simply read the latest data files to see the most recent updates. This model is suitable for workloads where read performance is more important. Its drawback is that sudden write bursts can cause many files to be rewritten, leading to heavy processing.

Merge on Read

In this model, when a record is updated, Hudi appends the change to a log associated with the table. As more writes occur, they are also appended to the log. Queries read data by merging the log with the base data files, or, if configured, they may read only the base files. If a user wants real‑time data, the query reads from the log; otherwise, for a read‑optimized table, it reads from the base files (which may be slightly stale). Hudi periodically compacts the log into the data files to keep them up‑to‑date, a process that can be scheduled based on use‑case requirements.

If your data lake contains multiple layers of datasets, and each layer’s output feeds the next computation, then as long as all those datasets are Hudi tables, record‑level updates can propagate automatically across the layers without needing to rewrite entire datasets.

All of the above concepts also apply to inserts and deletes. For deletions, Hudi offers soft‑delete and hard‑delete options: soft‑delete retains the record key and marks the record as deleted, while hard‑delete writes a blank value for the entire record, discarding both the key and the data.

Recommended Reading

Data Lake | Multi‑Engine Integration to Extract Value from Lake Data

Kylin Practice at Beike Zhaofang and HBase Optimization

NetEase Cloud Music Real‑Time Data Warehouse Practice with Flink + Kafka

Big Datadata lakeHudicopy-on-writeMerge on ReadRecord Update
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.