Ceph Storage Failure Recovery: Analysis and Step‑by‑Step Procedures
This article describes a real‑world Ceph storage incident caused by disk bad sectors, analyzes its impact, and presents two practical recovery methods—full disk copy with dd+nc and skipping the faulty sector during service start—along with detailed commands and post‑recovery steps.
Background
In Ceph clusters, hardware failures such as a single disk or whole host can trigger storage outages despite Ceph's multi‑replica and failure‑domain mechanisms. Although rare, such failures must be prepared for to ensure rapid service restoration.
Fault Site
The cluster runs Ceph Hammer 0.94.5 with a host‑level failure domain and a replication factor of 2. Two disks on different hosts failed sequentially, causing 57 client requests to be blocked for over 30 seconds. Key status information included:
2 pgs down; 2 pgs peering;
57 requests are blocked > 32 sec;
recovery 4292086/381750087 objects degraded (1.124%);
recovery 7/137511185 unfound (0.000%);
21/967 in osds are down;‘down’ means both replicas of some data are offline; ‘unfound’ indicates objects whose version cannot be located on any OSD.
Fault Recovery Analysis
The primary goal is to bring all OSD services back online, then determine whether data loss occurred and decide on the appropriate restoration action. The failure was caused by bad sectors on the disks, which prevented the OSD daemons from starting.
Method 1 – Physical Disk Copy (dd + nc)
The simplest approach is to copy the entire faulty disk to a new one, ignoring read errors from bad sectors.
Prepare an idle server in the same data center, install and format a new disk.
Use dd piped through nc to transfer data:
# Backup machine, new disk
nc -lp {port} | dd of=/dev/sde1
# Faulty machine
dd if=/dev/sdX conv=noerror | nc -q 10 {backup_ip} {nc_port}After the copy finishes, replace the failed disk with the new one and restart the OSD service. This method is reliable but can take up to two hours for a 1.2 TB disk at 200 MB/s, which may be unacceptable for production workloads.
Method 2 – Skip Bad Sectors During Service Start
Solution
When the bad sector range is small, the following steps can avoid the OSD crash:
Increase the OSD debug_filestore log level to 20/20 to locate the file containing the bad sector.
Start the OSD on the faulty disk; the log will reveal the offending file, e.g. rb.0.8e1ad1d.238e1f29.00000000a418 .
Move that file out of the OSD data directory so the daemon no longer reads the bad sector.
mv /home/ceph/var/lib/osd/ceph-387/current/28.7cb_head/DIR_B/DIR_C/DIR_7/DIR_6/rb.0.8e1ad1d.238e1f29.00000000a418__head_B7E767CB__1c \
/home/ceph/var/lib/osd/ceph-387/After moving the file, restart the OSD. The cluster will still report one unfound object, which can be recovered with:
ceph pg 28.7cb mark_unfound_lost revertIf both replicas of an object are lost, the lost flag must be used instead.
Recovery Process Summary
Identify the bad‑sector file via increased debug_filestore logging.
Move the file out of the OSD directory.
Restart the OSD service.
Use ceph pg … mark_unfound_lost revert to roll back the object to a healthy replica.
Conclusion
A single disk failure can cascade into a full‑machine outage if RAID controllers reset.
Concurrent failures on multiple replicas dramatically increase outage severity; three‑replica configurations mitigate this risk.
Never delete or format a failed disk before full recovery, as it may eliminate the chance to roll back.
Removing the problematic file is an effective way to unblock OSD startup.
Ceph provides detailed logs; combining them with knowledge of Ceph’s storage layout enables targeted fixes.
Older clusters with aging hardware are more prone to such double‑failure scenarios.
For clusters with deep‑scrub disabled, schedule manual deep‑scrubs to detect latent disk issues.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.