Operations 7 min read

Recovering a Ceph 16 Cluster After System Disk Failure

This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Recovering a Ceph 16 Cluster After System Disk Failure

This article explains how to recover a Ceph 16 cluster after a node's system disk fails. Even with RAID1, unexpected failures can bring down the MON, OSD, and MGR services on the failed node. If the MGR on that node was active, a standby node will take over.

Remove the Faulty Host

When a node cannot boot, remove it from the cluster from another healthy node:

<code>ceph orch host rm node4 --offline --force</code>

Re‑initialize the Node

Replace the failed system disk, reinstall the OS, rename the host (e.g., to

node1

), assign a new IP, and update

/etc/hosts

on all three Ceph nodes:

<code>192.168.1.1 node1
192.168.1.2 node2
192.168.1.3 node3</code>

Add the Ceph public key to the new host:

<code>ssh-copy-id -f -i /etc/ceph/ceph.pub node1</code>

Install Docker

<code>curl -sSL https://get.daocloud.io/docker | sh
systemctl daemon-reload
systemctl restart docker
systemctl enable docker</code>

Install cephadm and ceph‑common

<code># curl --silent --remote-name --location https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
# chmod +x cephadm
# ./cephadm add-repo --release pacific
# ./cephadm install
# ./cephadm install ceph-common</code>

Add the New Host to the Cluster

<code>ceph orch host add node1</code>

Verify the host list:

<code>ceph orch host ls</code>

The host will receive MON and crash services automatically, but it cannot manage the cluster until it has the admin keyring. Add the special

_admin

label to the host so cephadm distributes

ceph.conf

and the admin keyring:

<code>ceph orch host label add node1 _admin
# or during addition
ceph orch host add node1 --labels=_admin</code>

Create and Activate a New OSD

Create an empty OSD (returns ID 2):

<code># vceph osd create
2</code>

Activate the Bluestore tmpfs directory for the OSD:

<code>ceph-volume lvm activate (osdid) (fsid)</code>

Add authentication and crush map, then start the OSD:

<code>ceph auth add osd.2 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-2/keyring</code>

If the OSD daemon is not managed by cephadm, delete the OSD, format the underlying disk, and re‑add it.

<code># View OSD container ID (optional)
ceph orch ps --daemon_type osd
# Remove OSD from cluster
ceph osd out 2
ceph osd crush remove osd.2
ceph auth del osd.2
ceph osd rm 2
# Clean up device mappings
# dmsetup status
# dmsetup remove_all
# Format the disk
mkfs -t ext4 /dev/vdb</code>

Re‑add the OSD to the cluster:

<code>ceph orch daemon add osd node1:/dev/vdb</code>

After these steps the Ceph cluster returns to normal operation.

operationsstorageCephCluster RecoverySystem Disk
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.