Databases 19 min read

Analysis of MySQL Backup Blocking Issues and Optimization Strategies

The article investigates a production MySQL outage caused by FLUSH NO_WRITE_TO_BINLOG TABLES blocking during backups, presents three experimental scenarios to identify the root cause, and offers practical recommendations such as setting query timeouts, adding indexes, and using Percona XtraBackup options to avoid backup stalls.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Analysis of MySQL Backup Blocking Issues and Optimization Strategies

Background

During an early‑morning incident a middle‑platform (coupon) database could not be accessed, showing many connections in the state "Waiting for table flush". The queries were simple SELECTs, yet MySQL reported they were waiting for a table flush.

| 55279262 | xxx_app | 192.168.50.143:44623 | xxx_mam | Query | 206 | Waiting for table flush | SELECT count(*) FROM mam_user_coupon_package ...

To quickly relieve the pressure the author generated kill statements for the offending sessions:

select concat('kill ', id, ';') from PROCESSLIST where USER='xxx_app' and STATE='Waiting for table flush';

The issue re‑occurred repeatedly, so the server was restarted, which temporarily restored normal operation.

Post‑mortem Analysis

The root cause was a conflict between a long‑running backup operation and a concurrent FLUSH NO_WRITE_TO_BINLOG TABLES command triggered by Percona XtraBackup.

220425 01:01:48 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...

The flush command does not write to the binary log, preventing replicas from executing the same operation, and it acquires a global lock on all tables.

Backup Procedure Overview

The backup workflow includes:

Check connectivity and version.

Read configuration (data and log file locations).

Start redo‑log copy thread from the latest checkpoint.

Copy InnoDB files (.ibd, ibdata1, undo, etc.).

Execute FLUSH NO_WRITE_TO_BINLOG TABLES and FLUSH TABLES WITH READ LOCK to close all tables and obtain a backup lock.

Copy non‑InnoDB files.

Write binlog position information.

Execute FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS to flush redo logs.

Stop the redo‑log copy thread.

Release the lock with UNLOCK TABLES and finalize the backup.

Experimental Scenarios

Scenario 1 – Uncommitted Write Transaction

A transaction that begins, updates a row, and never commits does not block the backup.

begin;
update tb1 set id =10 where id =1;
/usr/bin/innobackupex ...
... completed OK!

Scenario 2 – Slow Query Inside a Transaction

A long‑running SELECT (using SLEEP(1000) ) inside an open transaction blocks the backup until the query times out.

begin;
select id,sleep(1000) from tb1;
/usr/bin/innobackupex ...
... backup completes only after query timeout ...

Scenario 3 – Slow Query Without a Transaction

A slow SELECT executed outside a transaction also blocks the backup, because the table remains locked until the query is killed or times out.

select id,sleep(1000) from tb1;
/usr/bin/innobackupex ...
... backup succeeds after query timeout ...

FLUSH TABLES Explanation

The statement FLUSH NO_WRITE_TO_BINLOG TABLES is equivalent to FLUSH TABLES but does not write to the binary log. It closes all open tables, forces a global lock, and flushes query and prepared‑statement caches.

The FLUSH statement has several variant forms that clear or reload internal caches, flush tables, or acquire locks.
By default the server writes FLUSH statements to the binary log; adding NO_WRITE_TO_BINLOG suppresses logging.
FLUSH TABLES closes all open tables, forces them to be closed, and flushes caches.

Production‑Level Findings

A 10‑hour slow query was identified in the production slow‑query log:

# Time: 2022-04-25T09:12:53.416049+08:00
# User@Host: BlueMoon[BlueMoon] @  [192.168.50.144]  Id: 55128898
# Query_time: 38910.325000  Lock_time: 0.000328  Rows_sent: 0  Rows_examined: 14637241
SELECT d.coupon_master_code FROM mam_coupon_activity_interact a ...

The mam_coupon_send_log table lacks an index on interact_id , which likely contributed to the long execution time. Adding the index is recommended:

alter table mam_coupon_send_log add index idx_interact_id(interact_id);

The server’s max_execution_time variable was set to 0 (no timeout). Setting a global timeout (e.g., 120 seconds) helps prevent runaway queries:

set global max_execution_time = 120000;  -- 120 seconds

Backup Optimization Options

Percona XtraBackup provides two useful options to avoid backup stalls caused by long‑running queries:

--ftwrl-wait-timeout=SECONDS   # wait for queries that block FLUSH TABLES WITH READ LOCK
--kill-long-queries-timeout=SECONDS   # kill queries that still block after the timeout

Example command:

/usr/bin/innobackupex --defaults-file=/etc/my.cnf --user=bigdata_admin --password=123456 \
  --socket=/mysqlData/data/mysql.sock --port=3306 \
  --ftwrl-wait-timeout=5 --kill-long-queries-timeout=5 /data/backup/

Summary

1. Business layer: Regularly optimize slow queries; aim for < 0.2 s execution time for OLTP workloads.

2. Database layer: Set a reasonable max_execution_time (e.g., 120 s) to automatically kill excessively long statements.

3. Backup layer: Use --ftwrl-wait-timeout and --kill-long-queries-timeout to limit lock‑wait time and prevent backup blockage.

mysqlDatabase OptimizationBackupslow queryPercona XtraBackupFLUSH TABLES
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.