Big Data 10 min read

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

WeiLi Technology Team
WeiLi Technology Team
WeiLi Technology Team
How to Diagnose and Resolve HDFS Safe Mode Issues

Problem Phenomenon

When a DataNode crashes, its blocks become corrupted and HDFS automatically switches to safe mode, which can be observed on the HDFS homepage as "Safe mode is ON".

What Is Safe Mode?

HDFS safe mode is a special read‑only state where the file system accepts only read requests; delete, modify, or block‑replication operations are blocked. The purpose is to guarantee data consistency and prevent data loss while the cluster stabilises.

How Safe Mode Is Entered

Passive entry – an administrator manually triggers safe mode, typically for maintenance or expansion, using the command

hdfs dfsadmin -safemode enter

and later exits with

hdfs dfsadmin -safemode leave

.

Active entry – the NameNode enters safe mode automatically during startup or when the cluster does not meet required safety thresholds. The system leaves safe mode only after several conditions are satisfied:

The number of live DataNodes meets the threshold defined by

dfs.namenode.safemode.min.datanodes

.

The percentage of blocks that have reached the minimum replication factor exceeds

dfs.namenode.safemode.threshold-pct

(default 0.999, i.e., 99.9%).

The minimum replication count per block meets

dfs.namenode.replication.min

(default 1).

After the above are met, the cluster must remain stable for the period set by

dfs.namenode.safemode.extension

(default 1 ms).

Typical direct causes for entering safe mode include:

Failed DataNode startup or loss of heartbeat to the NameNode.

Disk failures on DataNode storage volumes.

Disk partitions running out of space.

How to Solve

Analyze the cause

1. Check the HDFS Web UI for cluster and DataNode status. 2. Review logs (usually under

/var/log/

) for error details.

Fix the issue

Depending on the identified problem, take appropriate actions:

If a DataNode failed to start, repair and restart it.

If a disk partition is full, expand the storage.

If a storage volume is faulty, repair or replace it (note that data on the failed volume may be lost).

If data loss occurs, list corrupted blocks and their files with

hdfs fsck / -list-corruptfileblocks

or

hdfs fsck / -files -blocks -locations

, then delete the affected files using

hdfs fsck / -delete

after exiting safe mode:

Exit safe mode:

sudo -u hdfs hdfs dfsadmin -safemode leave

Delete corrupted files:

sudo -u hdfs hdfs fsck / -delete

After fixing or deleting the problematic blocks, restart the cluster; HDFS should exit safe mode and resume normal read/write operations.

Production Complete Process

All commands must be executed as the

hdfs

user (e.g.,

su - hdfs

).

Leave safe mode:

sudo -u hdfs hdfs dfsadmin -safemode leave

Check cluster status:

hdfs dfsadmin -report

List corrupted blocks:

hdfs fsck -list-corruptfileblocks

Run a health check:

hdfs fsck /

Inspect specific corrupted blocks:

hdfs fsck /path/to/corrupt/file -locations -blocks -files

Delete bad blocks:

hdfs fsck / -delete

Verify health again; if still unhealthy, repeat after some time.

If blocks remain, manually remove their files:

hdfs dfs -rm "/File/Path/of/the/missing/blocks"

Following these steps should restore HDFS to a healthy state.

Summary

Maintain at least two replicas for production HDFS blocks.

Monitor DataNode disk usage and expand storage before thresholds are exceeded.

Always analyze the root cause before forcibly exiting safe mode.

If block corruption occurs, attempt replication recovery first; only delete irrecoverable blocks and restore data from upstream sources.

Modify safe‑mode parameters with caution; avoid changing them just to force an exit.

big dataTroubleshootingcluster managementHDFSHadoopSafe Mode
WeiLi Technology Team
Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.