Databases 10 min read

TiDB Cluster Write‑Write Conflict Investigation and Resolution

This article analyzes a TiDB cluster performance incident where QPS dropped and duration spiked due to write‑write conflicts, detailing the monitoring data, root‑cause investigation of server‑busy and scheduler latch issues, and the attempted mitigation steps such as enabling txn‑local‑latches and adjusting insert statements.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
TiDB Cluster Write‑Write Conflict Investigation and Resolution

Problem Background – A production TiDB cluster imported data into new physical partitions using shard_row_id_bit and pre_split_region to avoid hotspots. A few days later, at around 01:24 on June 21, QPS sharply declined and query duration surged.

Observed Symptoms – Monitoring showed a sudden drop in QPS, increased duration , and alerts such as server is busy and kv:9007 Write conflict . Region count grew slowly without large‑scale balance, and PD reported stagnant disk usage.

Cluster Configuration

集群版本:v3.0.5
集群配置:普通SSD磁盘,128G内存,40 核cpu
TiDB nodes: tidb21, tidb22, …
TiKV nodes: tidb01‑tidb20, wtidb29, wtidb30

Analysis Process – The server is busy alert pointed to a TiKV node (IP ending with 218). Logs showed no obvious errors, so the node was restarted, which moved pending commands and worker CPU load away from it.

Log extraction command used:

cat 218.log | grep conflict | awk -F 'tableID=' '{print $2}'

Resulting logs revealed 1,147 write‑write conflicts within ten minutes, each containing fields such as kv:9007 , txnStartTS , conflictStartTS , conflictCommitTS , and the conflicting key (tableID, indexID, indexValues).

Using pd-ctl to convert timestamps and curl to map tableID to table names, the investigation confirmed that conflicts were concentrated on specific keys and regions.

Version Difference – Prior to TiDB v3.0.8 the default optimistic transaction model performed conflict detection only at COMMIT, causing high duration when many write‑write conflicts occurred. From v3.0.8 onward, the default pessimistic model adds locks during each DML, preventing such conflicts without application changes.

Root Cause – Write‑write conflicts triggered TiKV scheduler latch waiting, especially on the overloaded node (IP 218). Scheduler worker CPU spiked, and the node reported many “not leader” errors because regions were busy handling conflicting writes.

Mitigation Attempts

Enabled tidb_txn_local_latches to shift latch waiting to TiDB, hoping to relieve TiKV pressure.

Adjusted application SQL from plain INSERT to INSERT IGNORE so duplicate key errors (error 1062) are handled by the database instead of causing repeated conflicts.

After parameter changes, the issue persisted, but switching to INSERT IGNORE eliminated the spike, restoring normal QPS and duration.

Conclusion – The incident was caused by concentrated write‑write conflicts on a few hot keys/regions, leading to scheduler latch bottlenecks and server‑busy errors. Proper transaction mode (pessimistic) and idempotent insert logic are effective preventive measures.

performance monitoringTiDBDatabase Operationscluster troubleshootingWrite Conflict
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.