Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage
In this article, senior Alibaba engineer Zheng Kai analyzes Ozone’s role in the Hadoop ecosystem, arguing that despite its usefulness, Ozone cannot solve Hadoop’s core challenges of complexity, cost, and performance, and that Hadoop must focus on storage innovation, compute‑storage separation, and cloud integration to stay relevant.
Author Zheng Kai, a senior technical expert at Alibaba and founder of Apache Kerby, shares his perspective on the Hadoop ecosystem, particularly the emerging Ozone project, after years of optimizing Hadoop/Spark for large customers on Alibaba Cloud.
He acknowledges that Ozone is a valuable object‑storage system but argues that it cannot rescue Hadoop, which faces far larger challenges than Ozone can address.
Looking ahead, he predicts that while new projects like Spark and Flink will continue to evolve, Hadoop’s future depends on its core positioning as a big‑data platform, with storage and scheduling as its two pillars.
He explains Hadoop’s core storage role and notes that recent community work has focused on Ozone, an object‑storage system similar to AWS S3, introduced about five years ago.
He critiques Ozone’s relevance to Hadoop’s primary pain points—complex deployment, high cost, and performance—highlighting that Hadoop’s typical users are medium‑to‑large clusters where HDFS’s design still makes sense.
He describes the operational complexity of HDFS (multiple NameNodes, ZK services, JournalNodes, etc.) and how this complexity drives users toward alternatives like Spark and Databricks.
He discusses cost reduction via Erasure Coding (EC) and the historical attempts to integrate EC into HDFS, noting delays and community fragmentation that slowed adoption.
He points out that HDFS has struggled to adopt modern storage formats (Parquet, ORC, Arrow) and to support storage‑compute separation architectures such as Alluxio.
He argues that Hadoop’s ecosystem is becoming bloated with overlapping projects, and that the lack of unified leadership (e.g., the rivalry between Cloudera and Hortonworks) hampers coordinated progress.
Regarding Ozone, he concludes that it adds further complexity without substantially reducing cost or improving performance, and that most cloud providers already offer mature object‑storage services.
He recommends three focus areas for Hadoop: (1) strengthen big‑data storage solutions by accelerating EC and simplifying HDFS architecture; (2) support storage‑compute separation with caching layers like Alluxio and modern file formats; (3) embrace public‑cloud and cloud‑native technologies, integrating YARN with Kubernetes rather than building parallel object‑storage stacks.
He encourages community participation in emerging projects such as Ozone, Submarine, and JindoFS, emphasizing that open‑source contributions remain vital for Hadoop’s continued evolution.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.