How to Safely Backup and Restore etcd in a Kubernetes Cluster
This guide explains why etcd backup is critical for Kubernetes disaster recovery, walks through snapshot creation, distribution, scheduled cron jobs, and provides a step‑by‑step procedure to restore the cluster on all nodes, ensuring services resume correctly.
1. etcd Cluster Backup
etcd stores all Kubernetes cluster state; backing up its data is essential for disaster recovery.
Key points:
Backup can be performed on any node of the etcd cluster.
Use the v3 API (ETCDCTL_API=3) because Kubernetes 1.13+ no longer supports v2.
Example environment uses binary‑deployed k8s v1.18.6 with Calico.
1) View etcd data directories
<code>etcd data directory:
export ETCD_DATA_DIR="/data/k8s/etcd/data"
export ETCD_WAL_DIR="/data/k8s/etcd/wal"
</code>2) Create backup directory and take snapshot
<code># mkdir -p /data/etcd_backup_dir
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--endpoints=https://172.16.60.231:2379 \
snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db
</code>Copy the snapshot to the other etcd nodes:
<code>rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/
</code>Schedule daily backups with cron:
<code># chmod 755 /data/etcd_backup_dir/etcd_backup.sh
# crontab -l
0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup.sh > /dev/null 2>&1
</code>2. etcd Cluster Restore
Restoration must be performed on every etcd node.
Simulate data loss:
<code># rm -rf /data/k8s/etcd/data/*
</code>Stop services on all masters and etcd nodes:
<code># systemctl stop kube-apiserver
# systemctl stop etcd
</code>Delete old data and WAL directories on each node:
<code># rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal
</code>Restore snapshot on each node (example for 172.16.60.231):
<code>ETCDCTL_API=3 etcdctl \
--name=k8s-etcd01 \
--endpoints="https://172.16.60.231:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://172.16.60.231:2380 \
--initial-cluster="k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380" \
--data-dir=/data/k8s/etcd/data \
--wal-dir=/data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
</code>Repeat the command on the other two nodes, adjusting the IP address and node name accordingly.
Start etcd services, verify health, then start kube‑apiserver services and check cluster status.
After restoration, pods gradually return to the Running state, confirming a successful recovery.
3. Summary
Backing up the etcd cluster is the key to protecting a Kubernetes cluster. Restoration requires stopping kube‑apiserver, stopping etcd, restoring data on one node, starting etcd, and finally restarting kube‑apiserver.
Only one etcd node needs to be backed up; the snapshot is synchronized to other nodes.
Restoring from a single node’s snapshot is sufficient.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.