Big Data 9 min read

Understanding Apache Hudi Table Types: Copy On Write (COW) vs Merge On Read (MOR)

This article explains Apache Hudi's two table formats—Copy On Write and Merge On Read—by introducing key terminology, describing their file structures and versioning, comparing write and read latency, I/O cost, and write amplification, and concluding with guidance on choosing the appropriate format.

Big Data Technology Architecture

Oct 26, 2021

Understanding Apache Hudi Table Types: Copy On Write (COW) vs Merge On Read (MOR)

1. Abstract

Apache Hudi provides different table types to suit various use cases, namely Copy On Write (COW) and Merge On Read (MOR).

2. Terminology Introduction

Before diving into COW and MOR, it is helpful to understand several Hudi-specific terms.

2.1 Data File / Base File

Hudi stores data in columnar formats such as Parquet or ORC, referred to as data files or base files. These formats are industry‑standard and the terms are interchangeable.

2.2 Incremental Log File

In MOR tables, updates are written to incremental log files in Avro format. Each log file is associated with a base file; for example, updates to data_file_1 are recorded in a new log file that is merged with the base file at read time.

2.3 File Group (FileGroup)

A file group consists of a data file and its corresponding incremental log files. In COW tables each group contains only a base file, making the structure simpler.

2.4 File Version

When a data file is updated, a new version of that file is created that merges old records with the newly incoming ones.

2.5 File Slice (FileSlice)

A file slice represents a specific version of a data file together with its associated log files. The latest file slice for COW includes only the newest base files, while for MOR it includes both the newest base files and their log files.

With this context, we can now examine the COW and MOR table types.

3. COW Table

Each new batch write creates a new version of the corresponding data file, containing both existing records and the new batch. An example with three file groups illustrates how new versions are generated and how a new file group may be created for inserts.

During the write, COW merges records, which introduces some write latency, but the approach is simple, requires no additional services, and is easier to debug.

4. MOR Table

In MOR, the merge cost is shifted to the read side. Writes only create incremental log files; after indexing, Hudi creates appropriately named log files that belong to a file group.

At read time, the base files and their log files are merged in real time, which can increase read latency. Hudi mitigates this by periodically compacting log files into new base files. Users can configure inline or asynchronous compaction and set policies such as compacting after a certain number of commits.

5. Comparison

5.1 Write Latency

COW incurs higher write latency because it performs synchronous merges during writes, whereas MOR does not.

5.2 Read Latency

MOR typically has higher read latency due to real‑time merging, but appropriate compaction can reduce this impact.

5.3 Update Cost

COW creates new data file versions for each batch, leading to higher I/O cost; MOR writes only to log files, keeping I/O cost low.

5.4 Write Amplification

COW can generate multiple full‑size data files, increasing storage usage (e.g., five 100 MB files ≈ 500 MB). MOR stores a single base file plus small log files, resulting in much lower amplification (≈ 140 MB in the same scenario).

6. Conclusion

Although MOR has some drawbacks, it offers query flexibility such as read‑optimized queries. With properly configured asynchronous compaction, MOR can deliver its benefits without incurring significant latency trade‑offs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Apache Hudi COW MOR Table Types

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.