Big Data 14 min read

Apache Hudi 1.0: Design Reconsiderations and Key New Features

This article provides a comprehensive overview of Apache Hudi 1.0, detailing its architectural redesign, five major development directions, and the most important new capabilities such as LSM‑tree timeline, function indexes, file‑group readers/writers, partial updates, and non‑blocking concurrency control, along with performance evaluations and resource links.

DataFunTalk
DataFunTalk
DataFunTalk
Apache Hudi 1.0: Design Reconsiderations and Key New Features

Apache Hudi is an efficient transactional data‑lake platform that offers table‑format definitions, comprehensive transaction support, indexing, and change‑data‑capture, integrating with engines like Spark, Flink, and query layers such as Presto and Trino.

The 1.0 release revisits the original design, introducing five development directions: deeper query‑engine integration, a generalized relational data model, hybrid server‑less architectures, support for unstructured data, and enhanced self‑management of tables.

Key new features include:

LSM‑tree based timeline storage, replacing linear logs with a hierarchical structure to improve write and read efficiency.

Function indexes that enable data skipping by indexing transformed fields (e.g., hour‑level timestamps) without fine‑grained physical partitioning.

File‑group readers and writers that support merge‑on‑read tables, partial updates, and position‑based merges, reducing storage overhead and improving query performance.

Partial update support that records only changed fields, cutting write size dramatically and accelerating merges.

Non‑blocking concurrency control using MVCC and global monotonic timestamps to allow concurrent writes and GDPR‑style deletions without frequent conflicts.

Performance tests show significant gains: LSM‑tree timeline loading for 1 M commits in 367 ms, partial updates reducing write size by 70× and improving update latency by 1.4×, and overall merge speedups of 12‑20% on large tables.

Additional resources such as the 1.0 technical spec, documentation, blogs, Slack community, and GitHub repository are provided for deeper exploration.

performance optimizationBig DataLSM Treedata lakeApache HudiFunction IndexTransactional Storage
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.