Testing the Impact of group_replication_member_expel_timeout on MySQL Group Replication under Network Latency
This article investigates how the MySQL 8.0 group_replication_member_expel_timeout parameter influences node expulsion in a group replication cluster when network latency is introduced, describing the test environment, methodology, commands, observations, and configuration recommendations.
The author, with 11 years of experience in database work (mainly Oracle and MySQL) in the financial sector, reports occasional node expulsions in a MySQL Group Replication cluster, suspecting network jitter as the cause. MySQL 8.0.13 introduced the group_replication_member_expel_timeout parameter to tolerate such delays.
Parameter Explanation (translated): group_replication_member_expel_timeout specifies the waiting time (in seconds) before a suspected member is expelled from the group after a suspicion is raised. The initial 5‑second detection period is not counted. The default was 0 (immediate expulsion) up to MySQL 8.0.20; from 8.0.21 onward the default is 5 seconds.
To evaluate the parameter’s effect, the author simulated varying network delays and adjusted the timeout value, observing the impact on node expulsion.
Test Environment
Test Method
1. Set group_replication_member_expel_timeout to a chosen value Y on each node. 2. Simulate network disconnection for a duration X on a node. 3. After network recovery, check database logs for state changes and timestamps. 4. Log into MySQL to view cluster status. 5. Record results. 6. Repeat steps 1‑5 while varying Y or X.
Test Commands
-- Simulate network delay
# tc qdisc add dev eth0 root netem delay 10s
-- View cluster status
mysql> select * from performance_schema.replication_group_member_stats;
-- View database logs
# tail -f mysqld.logTest Process
1. Set timeout to 5 seconds on all nodes.
2. Check cluster status before inducing delay.
3. On node mgr2, use tc to add a 10‑second delay and record the start time.
4. Observe cluster status from mgr1; the state changes from UNREACHABLE to expelled after the timeout.
5. Logs on mgr2 show that after the 6‑second delay the node cannot reach others, and after the additional 4 seconds (total 10 s) it is expelled.
6. Remove the network delay with tc qdisc del dev eth0 root .
7. After recovery, the expelled node automatically attempts to rejoin (auto‑rejoin) and restores data via binlog, returning to normal operation.
Note: If network latency persists, the faulty node may enter ERROR state and require a group replication restart.
Test Results
Parameter Setting Recommendation
Setting group_replication_member_expel_timeout to 5 seconds helps avoid immediate node expulsion during brief network delays, while still allowing timely removal of truly faulty nodes. Adjust the value based on observed latency (typically <1 s) and ensure XCom message cache is large enough (tune group_replication_message_cache_size) if the timeout is increased.
Additional Considerations
1. A longer timeout requires sufficient XCom cache to hold messages during the detection period plus the initial 5 seconds. 2. From MySQL 8.0.21, the default auto‑rejoin attempts are 3 with a 5‑minute interval; this can be changed via group_replication_autorejoin_tries.
References
https://dev.mysql.com/doc/refman/8.0/en/group-replication-responses-failure-expel.html
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html
https://mp.weixin.qq.com/s/DPFmCGmEfubRWpoikbY-XQ
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.