Why TiDB’s GC Stalled and How to Fix Disk Space Alarms
This article walks through a real‑world TiDB/TiKV disk‑space alarm case, diagnosing why the GC worker got stuck, how TiCDC’s lingering changefeed caused outdated MVCC versions, and the step‑by‑step commands and monitoring tricks used to restore normal GC and reclaim storage.
When you see the title, you already know the three key points: (1) the problem – the KV nodes in the TiKV cluster exceed 80% disk usage, triggering frequent midnight alerts; (2) the culprit – the GC worker is idle; (3) the root cause – TiCDC is to blame.
The article promises detailed actions to reduce disk usage, explains GC’s inner workings, and shows why TiCDC caused the issue.
Midnight Alerts and Initial Handling
At 11 PM multiple TiKV disk‑space alerts were received. Common reasons for high disk usage include tight disk capacity, sudden data ingestion, or log explosion. In this case the cluster was a test environment, so the team simply removed old 2021 TiKV logs to bring usage below the 80% threshold.
Cleaning the Test TiDB Cluster
Two approaches were considered: (1) expand KV resources, (2) investigate oversized tables. Since it was a test cluster, the team first inspected table sizes using SQL:
<code>select TABLE_SCHEMA,TABLE_NAME,TABLE_ROWS,(DATA_LENGTH+INDEX_LENGTH)/1024/1024/1024 as table_size from tables order by table_size desc limit 20;</code>For partitioned tables:
<code>select TABLE_SCHEMA,TABLE_NAME,PARTITION_NAME,TABLE_ROWS,(DATA_LENGTH+INDEX_LENGTH)/1024/1024/1024 as table_size from information_schema.PARTITIONS order by table_size desc limit 20;</code>The built‑in tidb-ctl table disk-usage command was also mentioned for more accurate statistics.
After identifying several huge tables (some with billions of rows, some with zero rows but non‑zero size), the team refreshed table statistics and deleted an unused partition or truncated tables. However, GC progress stalled, prompting a deeper investigation.
Analyzing and Solving the GC Issue
Inspecting the TiDB server logs revealed that the GC worker kept waiting for an earlier safe point:
<code>{"level":"INFO","time":"2022/03/16 22:58:22.637 +08:00","caller":"gc_worker.go:329","message":"[gc worker] starts the whole job","safePoint":424280964185194498}</code>The safe point corresponded to a TSO timestamp of 2021‑04‑16 00:17:13, confirmed via tiup ctl pd tso :
<code>$ tiup ctl:v5.4.0 pd -u http://10.203.178.96:2379 tso 424280964185194498
system: 2021-04-16 00:17:13.934 +0800 CST</code>Running pd-ctl service-gc-safepoint showed that the TiCDC service held the same safe point:
<code>{"service_gc_safe_points":[{"service_id":"gc_worker","safe_point":431880793651412992},{"service_id":"ticdc","safe_point":424280964185194498}],"gc_safe_point":424280964185194498}</code>The offending TiCDC changefeed (ID dxl‑replication‑task ) was removed:
<code>tiup cdc:v5.4.0 cli changefeed remove --pd='http://10.xxxx.96:2379' --changefeed-id=dxl-replication-task</code>After waiting, GC safe point advanced, indicating progress. A second stuck changefeed ( ticdc-demo‑xxxx‑test‑shbt‑core ) was also removed using the same command, after which GC fully recovered.
Post‑GC, KV storage dropped from ~800 GB to ~200 GB, and the write CF size fell from 3.2 TB to ~790 GB, confirming that TiCDC’s lingering changefeed had generated excessive MVCC versions.
GC Principle
The GC module is part of the TiDB server. TiKV stores data using an LSM tree, creating a new version with a timestamp for each update (MVCC). The tikv_gc_life_time (default 10 min) and tikv_gc_run_interval control how long old versions are kept and when GC runs.
GC Process
A GC leader is elected in the TiDB cluster; its worker executes the following stages:
Resolve Locks – clears locks older than the safe point.
Delete Ranges – physically removes large continuous ranges created by DROP/TRUNCATE, using RocksDB’s UnsafeDestroyRange and recording progress in mysql.gc_delete_range tables.
Do GC – removes historical MVCC data for keys before the safe point; actual space reclamation depends on RocksDB compaction.
Send Safe Point to PD – finalizes the round.
GC Monitoring
Key metrics to watch in the TiKV‑detail view include total-unsafe-destroy-range , total-gc-keys , task duration, and GC speed (keys/s). Screenshots illustrate these metrics.
For deeper insights, refer to the official TiDB documentation and the “GC three‑part series” on AskTUG.
Xiaolei Talks DB
Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.