Master Online TiDB Migration: Step‑by‑Step Guide for Cross‑Data‑Center Moves
This guide details three online TiDB migration scenarios—including placement‑rule replica placement, TiCDC with BR backup, and hybrid strategies—covering preparation, network and resource requirements, configuration commands, region balancing, PD leader transfer, and post‑migration cleanup for seamless cross‑data‑center database relocation.
Online Data Center Prepare Phase
Typical same‑city migration within 150 km requires the following conditions:
Data centers within 150 km, usually in the same or adjacent cities.
At least two optical fiber dedicated lines between data centers, latency around 3 ms, stable long‑term operation.
Dual lines with bandwidth greater than 200 Gbps.
Resource requirements include better physical machines (especially high‑density models), appropriate disk, CPU and memory planning, a Kubernetes version and environment ready for binding physical node resources, and a high‑version tidb‑operator that has been tested and is ready for use.
Online TiDB Cluster Migration Switch方案
The migration architecture consists of the existing “online” Kubernetes cluster hosting the current TiDB components and a new Kubernetes cluster with a TiDB deployment. Two synchronization links are used: one based on TiDB placement‑rule replica placement and another using TiCDC (ticdc) to sync two independent TiDB clusters.
1. Cross‑cloud/k8s TiDB placement‑rule replica placement migration (≈60% of clusters)
Placement Rules, introduced in PD 4.0, allow fine‑grained control of replica count, location, host type, Raft voting rights, and leader eligibility. The feature is enabled by default in TiDB v5.0 and later.
<code># ./pd-ctl -i
config placement-rules show
[
{
"group_id": "pd",
"id": "default",
"start_key": "",
"end_key": "",
"role": "voter",
"is_witness": false,
"count": 5
}
]
</code>Example configuration to place three voter replicas in the old data center (zone1) and two follower replicas in the new data center (zone2):
<code>[
{
"group_id": "pd",
"id": "default",
"start_key": "",
"end_key": "",
"role": "voter",
"count": 3,
"label_constraints": [{"key": "zone", "op": "in", "values": ["zone1"]}],
"location_labels": ["rack", "host"]
},
{
"group_id": "pd",
"id": "online2",
"start_key": "",
"end_key": "",
"role": "follower",
"count": 2,
"label_constraints": [{"key": "zone", "op": "in", "values": ["zone2"]}],
"location_labels": ["rack", "host"]
}
]
</code>After the two follower replicas finish syncing, promote them to voters and gradually shift all five voters to the new data center:
<code>[
{
"group_id": "pd",
"id": "default",
"start_key": "",
"end_key": "",
"role": "voter",
"count": 0,
"label_constraints": [{"key": "zone", "op": "in", "values": ["zone1"]}],
"location_labels": ["rack", "host"]
},
{
"group_id": "pd",
"id": "online2",
"start_key": "",
"end_key": "",
"role": "voter",
"count": 5,
"label_constraints": [{"key": "zone", "op": "in", "values": ["zone2"]}],
"location_labels": ["rack", "host"]
}
]
</code>Advantages: automatic TiKV placement, transparent to applications, and minimal write performance impact because most voters remain in the old data center. Drawbacks: replica placement must match the cluster version, and cross‑data‑center leader reads may add latency.
Multi‑cloud homogeneous cluster migration steps
Create a TiDB cluster in online2 with the same version. Copy the old cluster’s tc.yaml and adjust clusterDomain and PD weight so that PD leader stays in the old data center during migration.
Create the same number of TiKV nodes and apply the placement‑rule. Decide voter/follower/learner roles according to the 3:2 example.
Increase region creation speed. Example command to set a higher store limit for the new TiKV stores: <code>kubectl exec tidb-test-0 -n tidb-test -- ./pd-ctl store | grep -B 2 'tidb-test.online2.com' | grep 'id' | awk -F':' '{print $2}' | awk -F',' '{print "store limit " $1 " 40 add-peer"}'</code>
Accelerate scheduling. Adjust PD limits: <code>config set leader-schedule-limit 16 # control Transfer Leader concurrency config set region-schedule-limit 2048 # control add/remove peer concurrency config set replica-schedule-limit 64 # concurrent replica tasks</code>
Balance region leaders to the new data center. Promote followers to voters as described above.
Scale down the old data‑center TiKV. Use remove‑peer operators for stubborn regions, e.g. generate a script with: <code>kubectl exec tidb-test-pd-0 -n tidb-test -- ./pd-ctl region store 10 | jq '.regions[] | "\(.id)"' | awk -F'"' '{print "kubectl exec tidb-test-pd-0 -n tidb-test -- ./pd-ctl operator add remove-peer " $2 " 10"}' > /home/daixiaolei/remove_peer.sh</code>
Transfer PD leader to the new data center. Either raise the leader_priority of the new PD or manually execute a leader transfer command.
Switch business read/write traffic to the new data center. Update DNS to point to the new PD/TiDB endpoints; existing connections will be drained automatically.
Shrink old TiDB servers, PD nodes, PVCs, and finally delete the old cluster. Verify no active connections before scaling down each component.
2. TiCDC‑based migration (≈30% of clusters)
This method restores a full backup of the old cluster using BR, then uses TiCDC to sync incremental changes.
Advantages: downstream cluster can run a newer TiDB version, isolation between old and new clusters, and easy rollback by creating a reverse TiCDC task. Drawbacks: higher TiCDC latency, limited sync throughput (≈30 k ops/s per worker), and potentially long BR restore times for large clusters.
TiCDC migration steps
Create a new TiDB cluster in online2. Version may be the same or upgraded.
Adjust tikv_gc_life_time on the old cluster. Example SQL: <code>mysql> select * from mysql.tidb where VARIABLE_NAME like '%gc_life_time%'; mysql> update mysql.tidb set VARIABLE_VALUE='72h' where VARIABLE_NAME='tikv_gc_life_time'; </code>
Backup the old databases to S3 using BR. <code>br backup db \ --pd "${PDIP}:2379" \ --db test \ --storage "s3://backup-data/db-test/2024-06-30/" \ --ratelimit 128 \ --log-file backuptable.log</code>
Restore the backup to the new cluster. <code>br restore db \ --pd "${PDIP}:2379" \ --db "test" \ --ratelimit 128 \ --storage "s3://backup-data/db-test/2024-06-30/" \ --log-file restore_db.log</code>
Create a TiCDC changefeed from the old to the new cluster. Retrieve the start TS from the backup logs and run: <code>kubectl logs -f backup-tidb-test-backup-06301455 -n tidb-test ./cdc cli changefeed create --pd=http://tidb-test-pd:2379 \ --sink-uri="tidb://test_wr:[email protected]:4000/" \ --start-ts=434373684418314309 \ --config service_tree.toml \ --changefeed-id=tidb-test-migration</code>
Validate synchronization. Monitor CDC metrics in Grafana or query the changefeed status; ensure no lag and data consistency.
Gradually shift read traffic to the new cluster, then write traffic during a low‑traffic window.
After a week of stable operation, stop the TiCDC reverse sync and delete the old cluster.
3. Other Scenarios
Dual‑write for log‑type workloads: deploy a new cluster and write to both for 7‑30 days before cut‑over.
BR/Dumping backup for clusters with only night‑time writes: restore with Lightning and switch primary after verification.
Summary
Over three months, dozens of TiDB clusters holding petabytes of data were migrated to a new data center using the above methods, supported by a platform‑wide DTS and migration tracking module. The project succeeded thanks to close collaboration between business teams and diligent engineering effort.
Xiaolei Talks DB
Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.