Handling Replication Anomalies in MySQL Slave IO Thread
This article analyzes MySQL replication anomalies caused by master failures or network interruptions that lead to incomplete transaction replay on slaves, demonstrates a reproducible experiment using network delay and iptables, and provides practical guidance for both recovering and permanently handling stalled slave IO threads.
Replication Anomaly
During replication, if the master crashes or the network is interrupted, the slave IO thread may stop, causing the slave to receive only a part of a transaction. For example, the master executes:
begin;
insert 1;
insert 2;
commit;The slave might receive only the BEGIN statement, or BEGIN; INSERT 1; , etc., and the slave SQL thread will replay this half‑transaction without committing or rolling back.
When the slave IO thread recovers, the SQL thread waits for the remaining binlog; until then the half‑transaction behaves like a manually executed incomplete transaction.
Experiment Process
Test steps:
##1. On the slave, use tc to simulate network latency, slowing binlog reading
tc qdisc add dev eth0 root netem delay 3000ms 3000ms 100%
##2. On the master, execute a multi‑statement transaction
begin;
update t2 set pad='4' where id < 40;
update t2 set pad='5' where id < 50;
update t2 set pad='6' where id < 60;
commit;
##3. Immediately after commit, cut the network between master and slave
iptables -A OUTPUT -d 172.16.21.4 -j DROP
iptables -A INPUT -s 172.16.21.4 -j DROPObservations on the slave include a worker thread showing Waiting for an event from Coordinator while simultaneously executing UPDATE t2 SET pad='5' WHERE id < 50 , indicating a contradiction. Additionally, SHOW SLAVE STATUS reports Retrieved_Gtid_Set equal to Executed_Gtid_Set , yet the worker thread is still replaying SQL, revealing another inconsistency.
When Slave IO Thread Cannot Recover
If the slave IO thread remains down for a long time, the SQL thread cannot obtain the remaining binlog, so the half‑transaction stays open and holds its locks, potentially blocking business queries. In a master‑failover scenario, the promoted master may inherit these locks.
The recommended action is to run STOP SLAVE , which forces the SQL thread to roll back the transaction and release locks. Note that STOP SLAVE will wait up to 60 seconds for the IO thread to fetch the remaining binlog before terminating the SQL thread.
When Slave IO Thread Recovers
After the IO thread resumes, if GTID‑based replication is used, the slave will start fetching the full binlog from the current GTID, roll back the partially applied transaction, and then replay the complete set of events.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.