Big Data 8 min read

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

This article explains Apache Hudi's clustering service, detailing its workflow, three execution modes, and layout optimization strategies—including linear, Z‑order, and Hilbert space‑filling curves—to improve storage locality and query performance in large‑scale data lake environments.

DataFunSummit

Aug 31, 2024

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

This article, translated from the original English blog, introduces Apache Hudi clustering from zero to one, focusing on the clustering service, its workflow, and layout‑optimization strategies.

Overview – Clustering groups "nearby" records into the same physical file to improve read latency, enable file‑skip techniques, and increase cache hit rates.

Motivations for clustering include reducing small‑file proliferation for low‑latency writes, aligning record locality with file‑level statistics for efficient skipping, and leveraging spatial locality for block caching.

Clustering workflow consists of a scheduling phase, where a ClusteringPlanStrategy selects eligible partitions and file slices, and an execution phase that deserializes the plan, loads input slices, merges records, writes them to new file groups, and reports write statistics. Users can customize the execution via a ClusteringExecutionStrategy, and each HoodieClusteringGroup is submitted as an independent parallel task.

Three execution modes are supported: inline, semi‑asynchronous, and fully asynchronous, controlled by configuration keys such as hoodie.clustering.inline, hoodie.clustering.schedule.inline, and hoodie.clustering.async.enabled.

Layout optimization strategies – Hudi provides three strategies for ordering records during bulk inserts: linear (dictionary order), Z‑order, and Hilbert. Linear works well when proximity is defined by a single column (e.g., timestamp), while Z‑order and Hilbert map multi‑dimensional points to one dimension, preserving spatial locality for datasets requiring multiple columns (e.g., latitude/longitude).

Space‑filling curves such as Z‑order and Hilbert traverse the N‑dimensional space, ensuring that points close on the curve remain close in the original space, which improves file locality and read efficiency.

Review – The article recaps clustering as part of Hudi's table service, highlights how layout strategies optimize storage, and invites readers to join the Apache Hudi community for further discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data clustering data storage Apache Hudi layout optimization Space-filling Curves

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.