Big Data 8 min read

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

This article explains the new HDFS disk balancer feature introduced in Hadoop 3, covering its purpose, supported volume‑selection policies, step‑by‑step usage, planning and execution commands, and how it helps maintain balanced storage across DataNode disks.

Efficient Ops

Feb 9, 2017

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

1. Introduction

HDFS now includes a comprehensive storage capacity management method for moving data across nodes, released in CDH 5.8.2 and later. DataNode stores blocks in local file system directories specified by dfs.datanode.data.dir. Typically each directory (volume) resides on a separate device such as HDD or SSD.

When writing new blocks, DataNode selects a disk using a volume‑selection policy. Two policies are supported:

Round‑robin

Available space (HDFS‑1804)

The round‑robin policy distributes new blocks evenly across available disks, while the available‑space policy prefers disks with the highest free space percentage.

By default, DataNode uses round‑robin, but long‑running clusters can become imbalanced due to large file deletions or adding new disks. Even the available‑space policy may still lead to inefficient I/O.

To address this, the Apache Hadoop community developed an online disk balancer (HDFS‑1312) that rebalances volumes on a running DataNode without taking it offline.

2. How to Use the Disk Balancer

First, ensure dfs.disk.balancer.enabled is set to true on all DataNodes. In CDH 5.8.2+, this can be configured via the HDFS section in Cloudera Manager.

Example scenario: a new disk /mnt/disk1 is added and mounted as /mnt/disk2. Each HDFS data directory resides on a separate disk, which can be verified with df.

The disk balancer workflow consists of three steps executed via the hdfs diskbalancer command: plan, execute, and query.

During planning, the HDFS client reads DataNode information from the NameNode to generate a JSON plan file that lists source and target volumes and the amount of data to move. The default planner is GreedyPlanner , which moves data from the most used device to the least used until distribution is even.

Users can set a space‑utilization threshold; if the difference is below the threshold, the planner considers the disks balanced. An optional --bandwidth flag limits I/O impact.

The generated plan is stored under /system/diskbalancer. To execute the plan on a DataNode, run:

hdfs diskbalancer -execute -plan /system/diskbalancer/plan.json

This submits the JSON plan to the DataNode, which runs it in a background BlockMover thread.

To check the task status, use the query command: hdfs diskbalancer -query The output PLAN_DONE indicates completion. Verify effectiveness by running df -h again; the disk usage difference should be reduced to below 10%.

3. Summary

With the internal DataNode disk balancer introduced in HDFS‑1312, CDH 5.8.2+ provides a full storage capacity management solution that supports three types of data movement: across nodes (balancer), across storage types (Mover), and between disks within a single DataNode (disk balancer).

4. Acknowledgements

HDFS‑1312 was developed by Anu, Zhou Xiaobin, and Arpit Agarwal from Hortonworks, together with Lei (Eddy) Xu and Manoj Govindasamy from Cloudera.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

HDFS Hadoop Storage Management Disk Balancer

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.