Databases 8 min read

Root Cause Analysis of MySQL Replication Lag Caused by Large Transactions in a Sharding Scenario

This article examines a real‑world MySQL master‑slave replication lag issue caused by massive multi‑table transactions during a sharding operation, details the investigative steps using system tables, binlog parsing, and my2sql analysis, and proposes three practical mitigation strategies.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Root Cause Analysis of MySQL Replication Lag Caused by Large Transactions in a Sharding Scenario

Background

MySQL master‑slave replication is a common architecture for high availability. When replication lag occurs, queries on the slave may return stale data, affecting user experience and failover.

A typical cause of lag is the execution of large transactions, especially those involving writes to many tables.

This article presents a real‑world production case where a sharding operation—reading from a large table and writing to 100 partition tables—generated massive transactions that severely increased replication delay.

Problem Description

The development team implemented a sharding operation that reads data from one large table, processes it, and writes the results into 100 partition tables. This created multi‑table writes within a single large transaction, causing the slave’s relay log to grow to 1.6 GB (exceeding the default 1.1 GB) and the INFORMATION_SCHEMA.INNODB_TRX table to show many rows inserted by large transactions.

Replication Lag Status

The slave lagged behind the master by 37 325 seconds, equivalent to four binlog files.

Investigation Process

Step 1: Check Slave Transactions

Query the INFORMATION_SCHEMA.INNODB_TRX table on the slave to view active transactions. The result confirmed large transactions with many inserted rows.

Step 2: Check Relay Log Size

The relay log size was 1.6 GB, larger than the default.

Step 3: Parse Binlog to Count Affected Rows

shell> mysqlbinlog --base64-output=decode-rows -vv  mysql-bin.003731 | awk \
'BEGIN {s_type=""; s_count=0;count=0;insert_count=0;update_count=0;delete_count=0;flag=0;} \
{if(match($0, /^#.*Table_map:.*mapped to number/)) {printf "Timestamp : " $1 " " $2 " Table : " $(NF-4); flag=1} \
else if (match($0, /(### INSERT INTO .*..*)/)) {count=count+1;insert_count=insert_count+1;s_type="INSERT"; s_count=s_count+1;} \
else if (match($0, /(### UPDATE .*..*)/)) {count=count+1;update_count=update_count+1;s_type="UPDATE"; s_count=s_count+1;} \
else if (match($0, /(### DELETE FROM .*..*)/)) {count=count+1;delete_count=delete_count+1;s_type="DELETE"; s_count=s_count+1;} \
else if (match($0, /^(# at) /) && flag==1 && s_count>0) {print " Query Type : "s_type " " s_count " row(s) affected" ;s_type=""; s_count=0; } \
else if (match($0, /^(COMMIT)/)) {print count=0;insert_count=0;update_count=0; delete_count=0;s_type=""; s_count=0; flag=0} } '   > 003731.txt

Analysis of 003731.txt showed most transactions affected 1‑4 rows, not a single massive row count.

Step 4: Identify Large Transactions in Binlog

shell> mysqlbinlog mysql-bin.003731 | grep "^BEGIN" -A 2 |grep -E '^# at' |awk '{print $3}' |awk 'NR==1 {at=$1} NR > 1 {print ($1-at);at=$1}' |sort -n -r |head -n 20 > 003731_top_20.txt

The top two entries correspond to transactions of roughly 700 MB each, matching the oversized relay log.

822733047
822623266
2789
2783
2779

Step 5: Use my2sql to Analyze Large Transactions

Running my2sql with appropriate parameters produced a file biglong_trx.txt that listed eight typical large transactions lasting from 2024‑01‑22 11:00 to 2024‑01‑22 18:48, each processing about 4 million rows across 100 partition tables.

shell> ./my2sql -user repuser -password repuserpassword -host 10.235.98.18 -port 3306 -work-type stats -start-file mysql-bin.003731 -stop-file mysql-bin.003735 -big-trx-row-limit 5000 -output-dir /tmp/log/my2sql_output

Solution 123

Split Large Transactions : Break multi‑table writes into smaller batches to reduce per‑transaction row count and execution time.

Optimize Sharding Logic : Refactor application code to minimize unnecessary multi‑table operations.

Monitoring and Alerting : Implement metrics and alerts for large transactions and replication lag to detect issues early.

References

[1] my2sql: https://github.com/liuhr/my2sql

Performance OptimizationShardingMySQLlarge transactionsReplication Lag
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.