Investigating Data Loss with gh-ost in MySQL AFTER_SYNC Semi‑Sync Replication and Applying a Fix
This article documents a reproducible test that shows gh-ost can lose rows when used on a MySQL 5.7 AFTER_SYNC semi‑synchronous replica, explains the underlying cause, and presents a source‑code modification that prevents the loss.
Background – A recent post claimed that using gh-ost for online DDL in MySQL AFTER_SYNC mode may cause data loss. The author reproduced the issue by configuring a MySQL 5.7 primary‑secondary setup with semi‑sync replication and a 60‑second artificial delay in gh‑ost.
Environment Preparation
Clone the latest gh‑ost source (v1.1.2) with git clone https://github.com/github/gh-ost.git and build it using the provided build.sh script.
Deploy a MySQL 5.7 master‑slave cluster (1 master, 1 slave) and enable AFTER_SYNC semi‑sync replication.
Configure the master’s rpl_semi_sync_master_timeout to a value larger than the artificial delay (e.g., 120 000 ms).
Validation Steps
Insert a 60‑second sleep at the start of addDMLEventsListener in ./gh-ost-master/go/logic/migrator.go .
Set the master’s semi‑sync timeout to 120 s.
Create a test table t and insert a row (id=1).
Run gh‑ost to execute ALTER TABLE t ENGINE=InnoDB; .
Stop the slave’s IO thread to simulate a lost ACK.
Insert a second row (id=2) on the master while gh‑ost is waiting.
The DDL completes after about 120 seconds, but the newly inserted row (id=2) is missing, confirming data loss.
Principle Analysis
The loss occurs because gh‑ost reads the table’s primary‑key range before the transaction that inserted id=2 is fully committed. In AFTER_SYNC mode the master waits for an ACK from the slave; the transaction remains in the redo log until the timeout expires, so gh‑ost never sees the new key value.
Fix Implementation
A pull request adds a shared read lock and a retry mechanism when gh‑ost fetches the range. The changes are made in ./gh-ost-master/go/sql/builder.go and ./gh-ost-master/go/logic/migrator.go . After recompiling and re‑running the test with the same configuration, the second row persists, proving the fix works.
Precautions
Adjust rpl_semi_sync_master_timeout only on the master.
Set rpl_semi_sync_master_wait_no_slave=ON to ensure the master truly waits for an ACK.
When multiple slaves exist, consider rpl_semi_sync_master_wait_for_slave_count for ACK behavior.
Conclusion
The experiment confirms that gh‑ost can lose data under specific AFTER_SYNC timing conditions, but the provided source‑code fix resolves the issue, making gh‑ost safe for semi‑synchronous environments.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.