Databases 6 min read

Handling Replication Anomalies in MySQL Slave IO Thread

This article analyzes MySQL replication anomalies caused by master failures or network interruptions that lead to incomplete transaction replay on slaves, demonstrates a reproducible experiment using network delay and iptables, and provides practical guidance for both recovering and permanently handling stalled slave IO threads.

Aikesheng Open Source Community

Jun 30, 2022

Handling Replication Anomalies in MySQL Slave IO Thread

Replication Anomaly

During replication, if the master crashes or the network is interrupted, the slave IO thread may stop, causing the slave to receive only a part of a transaction. For example, the master executes:

begin;<br/>insert 1;<br/>insert 2;<br/>commit;<br/>

The slave might receive only the BEGIN statement, or BEGIN; INSERT 1;, etc., and the slave SQL thread will replay this half‑transaction without committing or rolling back.

When the slave IO thread recovers, the SQL thread waits for the remaining binlog; until then the half‑transaction behaves like a manually executed incomplete transaction.

Experiment Process

Test steps:

##1. On the slave, use tc to simulate network latency, slowing binlog reading<br>tc qdisc add dev eth0 root netem delay 3000ms 3000ms 100%<br><br>##2. On the master, execute a multi‑statement transaction<br>begin;<br>update t2 set pad='4' where id < 40;<br>update t2 set pad='5' where id < 50;<br>update t2 set pad='6' where id < 60;<br>commit;<br><br>##3. Immediately after commit, cut the network between master and slave<br>iptables -A OUTPUT -d 172.16.21.4 -j DROP<br>iptables -A INPUT -s 172.16.21.4 -j DROP<br/>

Observations on the slave include a worker thread showing Waiting for an event from Coordinator while simultaneously executing UPDATE t2 SET pad='5' WHERE id < 50, indicating a contradiction. Additionally, SHOW SLAVE STATUS reports Retrieved_Gtid_Set equal to Executed_Gtid_Set, yet the worker thread is still replaying SQL, revealing another inconsistency.

When Slave IO Thread Cannot Recover

If the slave IO thread remains down for a long time, the SQL thread cannot obtain the remaining binlog, so the half‑transaction stays open and holds its locks, potentially blocking business queries. In a master‑failover scenario, the promoted master may inherit these locks.

The recommended action is to run STOP SLAVE, which forces the SQL thread to roll back the transaction and release locks. Note that STOP SLAVE will wait up to 60 seconds for the IO thread to fetch the remaining binlog before terminating the SQL thread.

When Slave IO Thread Recovers

After the IO thread resumes, if GTID‑based replication is used, the slave will start fetching the full binlog from the current GTID, roll back the partially applied transaction, and then replay the complete set of events.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Database MySQL replication Troubleshooting GTID slave_io_thread

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.