Tag

Cluster Recovery

0 views collected around this technical thread.

Sohu Tech Products
Sohu Tech Products
Feb 21, 2024 · Operations

Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster

When adding a ZooKeeper observer to a Codis cache cluster, the election port (3888) was unreachable because the QuorumCnxManager listener thread vanished, prompting telnet and log checks, and leading to a successful recovery by rolling upgrade to ZooKeeper 3.4.13, rebuilding the data directory, performing a rolling restart, and decommissioning the temporary node, thereby restoring full cluster quorum and normal Codis‑Proxy operation.

Cluster RecoveryQuorumCnxManagerVersion Upgrade
0 likes · 10 min read
Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster
Zhuanzhuan Tech
Zhuanzhuan Tech
Feb 7, 2024 · Operations

Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies

This article details a real‑world investigation of a ZooKeeper election‑port failure that prevented adding observer nodes to a Codis cache cluster, outlines systematic connectivity checks, log analysis, and two migration plans, and finally presents step‑by‑step procedures for rolling upgrades, configuration adjustments, and successful cluster restoration.

Cluster RecoveryCodisLog Management
0 likes · 12 min read
Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies
Xiaolei Talks DB
Xiaolei Talks DB
Mar 16, 2022 · Operations

How to Recover a TiKV Cluster After Multiple Node Failures

This article demonstrates how to simulate and recover TiKV cluster failures by shutting down one, two, or three nodes, explains the impact on Raft groups and region availability, and provides step‑by‑step commands for disabling PD scheduling, using tikv‑ctl, and restoring data integrity.

Cluster RecoveryData LossPD
0 likes · 28 min read
How to Recover a TiKV Cluster After Multiple Node Failures
Ops Development Stories
Ops Development Stories
Feb 25, 2022 · Operations

Recovering a Ceph 16 Cluster After System Disk Failure

This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.

CephCluster RecoveryOperations
0 likes · 7 min read
Recovering a Ceph 16 Cluster After System Disk Failure
Xiaolei Talks DB
Xiaolei Talks DB
Jan 25, 2022 · Databases

How to Recover a Failed TiDB PD Cluster with pd-recover: Step‑by‑Step Guide

This article walks through a real‑world TiDB PD cluster outage, explains how to diagnose the failure, retrieve necessary IDs, install and use the pd‑recover tool, and finally restore the cluster to a healthy state with detailed commands and screenshots.

Cluster RecoveryDatabase OperationsPD
0 likes · 12 min read
How to Recover a Failed TiDB PD Cluster with pd-recover: Step‑by‑Step Guide
Tencent Database Technology
Tencent Database Technology
Feb 27, 2019 · Operations

Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang

This article details a real‑world Elasticsearch cluster recovery issue where setting the shard recovery concurrency too high saturated the generic thread pool, causing the entire cluster to hang, and explains the underlying concepts, reproduction steps, analysis, and mitigation measures.

Cluster RecoveryElasticsearchTroubleshooting
0 likes · 10 min read
Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang