How to Recover a Failed TiDB PD Cluster with pd-recover: Step‑by‑Step Guide
This article walks through a real‑world TiDB PD cluster outage, explains how to diagnose the failure, retrieve necessary IDs, install and use the pd‑recover tool, and finally restore the cluster to a healthy state with detailed commands and screenshots.
Disaster Description
On Sunday a colleague reported that an online TiDB business line's cluster was unavailable. The TiDB/PD cluster was a mixed deployment with three PD nodes, but only one PD node remained alive, rendering the whole cluster unusable.
Many PD/TiDB instances were down and TiKV was unavailable, causing panic.
The cluster had been down for a while; attempts such as cluster restart, forced PD removal, and PD scaling were made without success.
On‑Site Inspection
Only PD1 responded to ping and its process was running. PD2 was unreachable, and PD3 could be logged into but failed to start the service (tiup start hung, manual start reported errors).
Problem Analysis
In a three‑node PD cluster, a majority of PD nodes must be alive. The goal was to bring at least one of PD2 or PD3 back online.
Two approaches were taken:
Trigger an automatic restart task on the unreachable PD2 server, hoping the other PD would bring the majority back up.
Attempt to restart PD3, which had been cleaned with tiup scale‑in --force but still retained its data directory and binaries. The startup log showed an "etcd cluster ID mismatch" error.
The error was resolved by removing the incorrect member from the --initial-cluster parameter. Meanwhile, PD2 started successfully.
"Solution"
After PD2 recovered, the cluster became functional, but a more robust recovery method was needed for cases where multiple PD nodes fail.
PD Cluster Recovery Tool: pd‑recover
pd‑recover is a disaster‑recovery utility for PD clusters that can restore PD nodes that cannot start normally.
Installation
wget https://download.pingcap.org/tidb-v5.3.0-linux-amd64.tar.gz tar -zxvf tidb-v5.3.0-linux-amd64.tar.gz cd tidb-v5.3.0-linux-amd64/bin/
Obtaining the Cluster ID
The earliest Cluster ID can be found in PD, TiKV, or TiDB logs. Example command:
cat pd.log | grep "init cluster id"
If PD logs are unavailable, the same query can be run against TiDB or TiKV logs.
Getting the Alloc‑ID
Alloc‑ID must be larger than the current maximum allocated ID. It can be retrieved from Prometheus:
Or by running on each PD node:
cat pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1 Result example: 1609000 Multiply by 100 to obtain the alloc‑id (e.g., 160900000).
Deploying a New PD Cluster (Optional)
If all original PD nodes are healthy, you can delete their data directories and restart them, which effectively creates a new PD cluster without adding new machines.
systemctl stop pd-2379.service mv /data/pd-2379 /data/bak/pd2379 tiup cluster start zhanshi_2 -R pd
Using pd‑recover to Restore the Cluster
./pd-recover -endpoints http://pd-id:2379 -cluster-id 6917597403461510168 -alloc-id 160900000 recover success! please restart the PD cluster
Restart the Whole Cluster
After the success message, restart the entire TiDB cluster and verify the status.
Common Issues
Multiple Cluster IDs found: use the earliest one from logs for pd‑recover.
pd‑recover fails because PD is not running: ensure the PD service is deployed and started before running pd‑recover.
{"level":"warn","ts":"2022-01-25T13:36:48.549+0800","msg":"retrying of unary invoker failed","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: transport: Error while dialing dial tcp XXX:2379: connect: connection refused"} context deadline exceeded
Reflection
1. If the cluster is blocked by problematic SQL, consider using VIP throttling or setting max_execution_time to quickly relieve pressure.
2. Avoid panic‑driven actions; plan multiple solutions before executing.
3. Record all operations for post‑mortem analysis.
4. Regularly report slow SQL and perform pre‑deployment reviews.
5. For critical clusters, ship logs to external storage (e.g., S3) to ensure availability during outages.
The article focuses on recovering a PD majority failure; future posts will address scenarios where most TiKV replicas are unavailable.
Xiaolei Talks DB
Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.