Big Data 5 min read

Reframing Apache Hudi as a Data Lake Platform: Vision, Capabilities, and Future Directions

Apache Hudi is being re‑positioned from a simple table format to a full‑featured data lake platform, offering transactional storage, MVCC concurrency, metadata services, Deltastreamer ingestion, and plans for cache and timeline metadata services, aligning its vision with modern lakehouse architectures.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Reframing Apache Hudi as a Data Lake Platform: Vision, Capabilities, and Future Directions

With the rise of the data‑lake concept, many articles treat Apache Hudi merely as a table format, prompting the community to reconsider Hudi’s true vision and to discuss redefining it as a data‑lake platform.

We now view Hudi as a data‑lake platform that not only provides table formats but also includes a transactional storage layer, and we have redesigned its ecosystem diagram to reflect this broader vision.

Hudi currently offers the following capabilities:

Table format : stores table schema and metadata (file lists, with future extensions for column information and query‑optimization hints).

Auxiliary metadata : bloom filters, record‑level indexes, bitmap/interval trees, and other advanced on‑disk data structures.

Concurrency control : MVCC for serializing writes, and since version 0.8.0, optimistic concurrency control (OCC) for batch merge workloads; future plans include multi‑table and fully non‑blocking writes.

Update/Delete : primary‑key/unique‑key enforcement, with future cross‑table transaction support that could enable foreign‑key constraints.

Table services : self‑managed pipeline features such as file‑size management, automatic cleaning, compaction, clustering, and cold‑start handling, all of which can run independently.

Data services : the Deltastreamer tool for ingesting from DFS, Kafka (and upcoming Pulsar) sources, incremental ETL, deduplication, commit callbacks, and forthcoming pre‑commit validation and error‑table features; also extensible toward streaming sinks and data monitoring.

Potential future extensions (subject to discussion/RFC) include:

Cache service : a Hudi‑specific cache that stores mutable data and provides cross‑engine query capabilities.

Timeline metadata server : currently supported in Spark via RocksDB or Hudi metadata tables, which could evolve into a scalable, sharded metadata storage service usable by all engines.

We propose renaming the project to Data Lake Platform rather than the longer description “storage and management of large analytical datasets via DFS (HDFS or cloud storage)”. This new name better conveys our vision and offers newcomers a clearer perspective on the project.

This evolution is analogous to how Kafka transitioned from a simple pub‑sub system to a full‑featured streaming event platform with MirrorMaker, Connect, etc.

For the detailed discussion, see: Apache Hudi dev mailing list thread .

big datametadatadata lakelakehouseApache HudiTransactional Storage
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.