Why MySQL Replication Lag Isn’t Just a Network Issue
The article explains MySQL master‑slave replication fundamentals, shows how to monitor replication status, enumerates common delay causes such as network latency, master write pressure, SQL thread bottlenecks, large transactions, missing primary keys, slave overload, replication conflicts and GTID quirks, and provides scripts, configuration tips, and real‑world case studies for troubleshooting and prevention.
1 MySQL Replication Basics
MySQL master‑slave replication copies write operations from the master to one or more slaves, providing data redundancy and read/write separation. Replication relies on the binary log (binlog) and can operate in asynchronous or semi‑synchronous mode.
1.1 Replication Architecture
The replication flow consists of the following threads:
Master Binlog Dump Thread : reads the binlog and sends it to requesting slaves.
Slave I/O Thread : connects to the master, requests the binlog, and writes it to the relay log.
Slave SQL Thread : reads the relay log and executes the SQL statements.
Master (Master) Slave (Slave)
| |
| Write transaction |
| --> record binlog |
| |
| Dump Thread sends binlog -->|
| | I/O Thread receives
| | --> write to relay log
| |
| | SQL Thread reads relay log
| | --> execute SQL
| |
| Commit | Apply1.2 Replication Modes
Asynchronous : the master returns to the client immediately after committing, without waiting for the slave.
Semi‑synchronous : the master waits until at least one slave acknowledges receipt of the binlog before returning.
MySQL 8.0 adds enhanced semi‑synchronous , which waits until the slave has applied and committed the transaction.
1.3 Binlog Formats
STATEMENT : logs the SQL statement. Small log size but functions like NOW() may produce different results on the slave.
ROW : logs the changed rows. Accurate results but larger log size.
MIXED : default; uses STATEMENT and switches to ROW when needed.
-- View current binlog format
SHOW VARIABLES LIKE 'binlog_format';
-- Set binlog format (requires SUPER)
SET GLOBAL binlog_format = 'ROW';
SET GLOBAL binlog_format = 'STATEMENT';
SET GLOBAL binlog_format = 'MIXED';2 Replication Monitoring
2.1 View Replication Status
-- MySQL 5.7
SHOW SLAVE HOSTS;
SHOW SLAVE STATUS\G;
-- MySQL 8.0 (recommended)
SHOW REPLICA STATUS\G;Key fields include Seconds_Behind_Master (expected 0), Slave_IO_Running, Slave_SQL_Running, and log positions.
2.2 Real‑time Monitoring Script
#!/bin/bash
# filename: check_replication_delay.sh
MYSQL_HOST="192.168.1.100"
MYSQL_PORT="3306"
MYSQL_USER="root"
MYSQL_PASS="your_password"
THRESHOLD=30 # seconds
while true; do
DELAY=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null |
grep "Seconds_Behind_Master" | awk '{print $2}')
if [ "$DELAY" = "NULL" ]; then
echo "$(date) [WARN] Replication threads not running"
elif [ $DELAY -gt $THRESHOLD ]; then
echo "$(date) [ERROR] Delay: ${DELAY}s exceeds threshold ${THRESHOLD}s"
mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null |
grep -E "(Slave_IO_Running|Slave_SQL_Running|Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Relay_Log_Space|Last_Error)"
else
echo "$(date) [OK] Delay: ${DELAY}s"
fi
sleep 5
done2.3 Monitoring with Performance Schema
-- Enable performance_schema
UPDATE mysql.general_log SET 'general_log'='ON';
USE performance_schema;
SELECT * FROM replication_connection_status\G;
SELECT * FROM replication_applier_status\G;
SELECT * FROM replication_applier_status_by_worker\G;
SELECT * FROM replication_group_members;
SELECT * FROM replication_applier_status_by_worker WHERE LAST_ERROR_NUMBER != 0;3 Common Delay Causes & Troubleshooting
3.1 Network Issues
Investigation steps :
# Test latency
ping -c 100 192.168.1.100 | tail -5
# Test bandwidth
iperf3 -c 192.168.1.100
# Check MySQL processlist for binlog dump state
mysql -e "SHOW PROCESSLIST;" | grep -E "Binlog Dump|Has sent all"
# System network stats
ss -tan | grep 3306
netstat -i | grep eth0Solutions :
Check latency and bandwidth between master and slave.
Enable compression: SET GLOBAL slave_compressed_protocol = 1; If cross‑datacenter, consider dedicated lines or optimized topology.
3.2 Master Write Pressure
Investigation steps :
# Check QPS
SHOW STATUS LIKE 'Com_insert';
SHOW STATUS LIKE 'Com_update';
SHOW STATUS LIKE 'Com_delete';
SHOW STATUS LIKE 'Questions';
# Check binlog write speed
SHOW STATUS LIKE 'Binlog_disk_writes';
SHOW STATUS LIKE 'Binlog_stmt_cache_disk_use';
# Check relay log write speed on slave
SHOW STATUS LIKE 'Relay_log_space';
# Inspect write threads on master
SHOW PROCESSLIST;Solutions :
Optimize master writes, avoid bulk writes during peak hours.
Use parallel replication to speed up slave apply.
Adjust sync_binlog (e.g., 1000) and innodb_flush_log_at_trx_commit (e.g., 2) to balance safety and performance.
3.3 Slave SQL Thread Bottleneck
Investigation steps :
# View SQL thread state
SHOW PROCESSLIST;
# Look for states such as "Reading event from the relay log" or "Waiting for an event from Coordinator"
# Examine worker threads
USE performance_schema;
SELECT * FROM replication_applier_status_by_worker;
# Check relay log apply positions
SHOW STATUS LIKE 'Slave_retrieved_gtid_set';
SHOW STATUS LIKE 'Slave_executed_gtid_set';Common problems :
Single‑thread replication (MySQL 5.6 and earlier).
Large transactions blocking the SQL thread.
Solutions :
Upgrade to MySQL 5.7+ and enable multi‑threaded parallel replication (e.g., SET GLOBAL slave_parallel_workers = 16; and SET GLOBAL slave_parallel_type = LOGICAL_CLOCK;).
Split large transactions into smaller batches.
3.4 Large Transactions
Symptoms : Master writes quickly, but slave delay spikes then slowly declines.
Investigation steps :
# Monitor relay log position changes
SHOW MASTER STATUS; # on master
SHOW SLAVE STATUS; # on slave, compare Exec_Master_Log_Pos
# Identify long‑running transactions
SHOW ENGINE INNODB STATUS; # look at History list lengthSolution : Split large DELETE/UPDATE operations into batches, e.g.:
-- Bad: delete all rows at once
DELETE FROM large_table WHERE created_at < '2024-01-01';
-- Good: batch delete
DELETE FROM large_table WHERE created_at < '2024-01-01' LIMIT 10000;
-- Loop until done3.5 Missing Primary Key or Unique Index
When using ROW format, tables without a primary key require full table scans for UPDATE/DELETE, causing delay.
Investigation steps :
# Find tables without primary key
SELECT t.TABLE_SCHEMA, t.TABLE_NAME
FROM information_schema.TABLES t
JOIN information_schema.COLUMNS c ON t.TABLE_SCHEMA = c.TABLE_SCHEMA AND t.TABLE_NAME = c.TABLE_NAME
WHERE t.TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')
AND t.TABLE_TYPE = 'BASE TABLE'
AND t.AUTO_INCREMENT IS NOT NULL
GROUP BY t.TABLE_SCHEMA, t.TABLE_NAME
HAVING MAX(c.COLUMN_KEY) != 'PRI';Solution :
Add a primary key to every table.
If a true primary key is impossible, add a unique index covering the necessary columns.
3.6 Slave Server Load
Investigation steps :
# CPU
top
htop
# Disk I/O
iostat -x 1
iotop
# Memory
free -h
# InnoDB buffer pool hit rate
mysql -e "SHOW ENGINE INNODB STATUS\G" | grep "Buffer pool hit rate"
# Connection count
mysql -e "SHOW STATUS LIKE 'Threads_connected';"
mysql -e "SHOW STATUS LIKE 'Max_used_connections';"Solutions :
Move the slave off a heavily loaded host.
Increase hardware resources (CPU, RAM, I/O).
Optimize queries if the slave also serves reads.
Adjust InnoDB settings (e.g., innodb_buffer_pool_size, innodb_thread_concurrency, I/O thread counts).
3.7 Replication Conflicts
Writes performed directly on a slave can cause duplicate‑key errors and stop the SQL thread.
Investigation steps :
# Check replication error fields
SHOW SLAVE STATUS\G # look at Last_Errno and Last_Error
# Examine slave error log (path from SHOW VARIABLES LIKE 'log_error')
# Verify no write queries are running on the slave
SHOW PROCESSLIST;Solutions :
Ensure the slave is read‑only ( SET GLOBAL read_only = ON; and SET GLOBAL super_read_only = ON;).
Fix the conflicting data or, if acceptable, skip the offending transaction using SET GLOBAL gtid_skip_counter = 1; (or STOP SLAVE; SET GLOBAL slave_skip_counter = 1; START SLAVE; for non‑GTID).
Use tools like pt-table-checksum to verify consistency.
3.8 GTID‑Mode Specific Issues
GTID replication can show delay when the I/O thread has received GTIDs that the SQL thread has not yet applied.
Investigation steps :
# View GTID execution state
SHOW SLAVE STATUS\G # Retrieved_Gtid_Set vs Executed_Gtid_Set
# Compare the two sets to see pending GTIDs
# Examine recent statements
SELECT * FROM performance_schema.events_statements_history ORDER BY TIMER_WAIT DESC LIMIT 10;
# Check if any writes are occurring on the slave
SELECT * FROM performance_schema.replication_connection_status\G;Common fixes :
Adjust gtid_skip_counter cautiously.
Ensure binlog retention on the master is sufficient ( SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';).
4 Real‑World Troubleshooting Cases
4.1 Case 1 – Fluctuating Slave Delay
Observation : Delay varies between 0 s and 300 s.
Investigation :
# Continuously monitor delay
while true; do
echo "$(date): $(mysql -e 'SHOW SLAVE STATUS\G' | grep Seconds_Behind_Master | awk '{print $2}')"
sleep 1
done
# Notice hourly bulk imports on master cause binlog spikes.
# Check master binlog position growth per minute.
# Observe slave SQL thread stuck on "Reading event from the relay log".Root cause : Hourly bulk imports generate a burst of binlog events that the slave cannot apply quickly.
Resolution :
SET GLOBAL slave_parallel_workers = 16;
SET GLOBAL slave_parallel_type = LOGICAL_CLOCK;4.2 Case 2 – Slave Stops Replicating
Observation : Seconds_Behind_Master shows 0, but Relay_Log_Pos no longer advances.
Investigation :
SHOW SLAVE STATUS\G
# Output shows Slave_IO_Running = Yes, Slave_SQL_Running = No
# Last_Errno = 1062, Last_Error = "Duplicate entry for key 'PRIMARY'"
# Verify the problematic row on master.
SELECT * FROM problem_table WHERE id = xxx;Root cause : Data conflict on the slave caused the SQL thread to stop.
Resolution (choose one):
Skip the error (acceptable if duplicate rows can be ignored):
STOP SLAVE;
SET GLOBAL gtid_skip_counter = 1;
START SLAVE;
SET GLOBAL gtid_skip_counter = 0;Delete the conflicting row and restart replication.
DELETE FROM problem_table WHERE id = xxx;
START SLAVE;Run pt-table-checksum to verify overall consistency.
4.3 Case 3 – Delay Growing Continuously
Observation : Delay starts at a few seconds and climbs to hours.
Investigation :
# Check I/O vs SQL positions
SHOW SLAVE STATUS\G # large gap between Read_Master_Log_Pos and Exec_Master_Log_Pos
# Examine slave load (top, iostat, etc.)
# Look at relay log size
SHOW STATUS LIKE 'Relay_log_space';
# Check for large transactions via InnoDB status
SHOW ENGINE INNODB STATUS; # large History list length
# Identify slow queries on slave
SHOW VARIABLES LIKE 'slow_query_log%';Root cause : A specific UPDATE lacking an index runs slowly on the slave.
Resolution :
# Analyze the query on master
EXPLAIN UPDATE problem_table SET status='done' WHERE updated_at < '2024-01-01';
# Add missing index
ALTER TABLE problem_table ADD INDEX idx_updated_at (updated_at);4.4 Case 4 – High Delay Despite Low Network Latency
Observation : Ping latency is low, yet replication lag is high.
Investigation :
# Capture traffic
tcpdump -i eth0 port 3306 -w /tmp/replication.pcap
# Analyze: many tiny binlog events increase overhead.
# Check slave timeout settings
SHOW VARIABLES LIKE 'net_%';Root cause : Master issues many small transactions, causing high protocol overhead.
Resolution :
# Reduce row image size
SET GLOBAL binlog_row_image = 'MINIMAL';
# Increase network timeouts
SET GLOBAL net_write_timeout = 300;
SET GLOBAL net_read_timeout = 300;
# Enable compression
SET GLOBAL slave_compressed_protocol = 1;5 Prevention & Optimization
5.1 Configuration Optimizations
# my.cnf example
[mysqld]
# Replication
log-slave-updates = 1
slave-parallel-workers = 16
slave-parallel-type = LOGICAL_CLOCK
slave-preserve-commit-order = 1
slave_compressed_protocol = 1
# Binlog
sync_binlog = 1000
binlog_format = ROW
binlog_row_image = MINIMAL
binlog_expire_logs_seconds = 604800 # 7 days
# InnoDB
innodb_flush_log_at_trx_commit = 2
innodb_buffer_pool_size = 16G
innodb_thread_concurrency = 32
innodb_read_io_threads = 16
innodb_write_io_threads = 16
innodb_flush_method = O_DIRECT
# Network
net_write_timeout = 300
net_read_timeout = 3005.2 Monitoring & Alert Scripts
#!/bin/bash
# replication_monitor.sh – alert on thread status, delay, and errors
MYSQL_HOST="192.168.1.100"
MYSQL_PORT="3306"
MYSQL_USER="monitor"
MYSQL_PASS="monitor_password"
ALERT_EMAIL="[email protected]"
THRESHOLD_WARNING=30
THRESHOLD_CRITICAL=300
STATUS=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null)
IO_RUNNING=$(echo "$STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
DELAY=$(echo "$STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')
LAST_ERROR=$(echo "$STATUS" | grep "Last_Error:" | awk -F': ' '{print $2}')
# Thread checks
if [[ "$IO_RUNNING" != "Yes" || "$SQL_RUNNING" != "Yes" ]]; then
echo "[CRITICAL] Replication threads down" | mail -s "[CRITICAL] MySQL Replication Down" $ALERT_EMAIL
exit 1
fi
# Delay checks
if [[ "$DELAY" == "NULL" ]]; then
echo "[CRITICAL] Unable to fetch delay" | mail -s "[CRITICAL] MySQL Replication Error" $ALERT_EMAIL
exit 1
fi
if (( DELAY >= THRESHOLD_CRITICAL )); then
echo "[CRITICAL] Replication lag ${DELAY}s exceeds ${THRESHOLD_CRITICAL}s" | mail -s "[CRITICAL] MySQL Replication Lag" $ALERT_EMAIL
elif (( DELAY >= THRESHOLD_WARNING )); then
echo "[WARNING] Replication lag ${DELAY}s exceeds ${THRESHOLD_WARNING}s" | mail -s "[WARNING] MySQL Replication Lag" $ALERT_EMAIL
fi
# Error check
if [[ -n "$LAST_ERROR" ]]; then
echo "[ERROR] Replication error: $LAST_ERROR" | mail -s "[ERROR] MySQL Replication Error" $ALERT_EMAIL
fi5.3 Regular Health‑Check Script
-- check_replication_health.sql
-- Verify thread status and delay
SELECT CASE
WHEN Slave_IO_Running = 'Yes' AND Slave_SQL_Running = 'Yes' AND Seconds_Behind_Master = 0 THEN 'HEALTHY'
WHEN Slave_IO_Running = 'Yes' AND Slave_SQL_Running = 'Yes' AND Seconds_Behind_Master > 0 THEN 'LAGGING'
ELSE 'BROKEN'
END AS replication_status,
Seconds_Behind_Master,
Master_Log_File,
Read_Master_Log_Pos,
Relay_Master_Log_File,
Exec_Master_Log_Pos,
Last_Error
FROM information_schema.PROCESSLIST
WHERE Command = 'Binlog Dump';6 Summary
6.1 Delay Investigation Checklist
Delay seconds – SHOW SLAVE STATUS → Seconds_Behind_Master
I/O thread status – check Slave_IO_Running SQL thread status – check Slave_SQL_Running Network latency – ping and iperf3 Master write load – SHOW STATUS LIKE 'Com_%' Slave load – top, iostat, free Large transactions – InnoDB History list length
Missing indexes – EXPLAIN slow queries
Parallel replication settings –
slave_parallel_workers6.2 Quick Reference of Common Causes
Network problems : I/O thread lag → check bandwidth and latency.
Master overload : delay correlates with master QPS → optimize writes or enable parallel replication.
SQL thread bottleneck : single‑thread replication → enable multi‑threaded replication.
Large transactions : sudden delay spikes → split transactions.
Missing primary key : relay log stalls → add PK or unique index.
Slave overload : delay grows → scale resources or offload reads.
Replication conflicts : SQL thread stops → fix data or skip error.
6.3 Optimization Recommendations
Use GTID for simpler replication management.
Enable multi‑threaded parallel replication (LOGICAL_CLOCK recommended for MySQL 5.7+).
Consider semi‑synchronous replication for stronger durability.
Run regular data‑consistency checks with pt-table-checksum.
Implement robust monitoring and alerting for delay thresholds.
Avoid bulk writes during peak hours; batch them if necessary.
Tune InnoDB buffer pool, thread concurrency, and I/O threads on the slave.
Replication lag is rarely caused by a single factor; it requires a holistic view of master, slave, network, and configuration. Consistent monitoring and proactive tuning enable rapid root‑cause identification and keep the replication pipeline healthy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
