Understanding MHA Health Check Mechanism and Ping Types for MySQL Failover
The article explains why MHA’s default long‑connection ping cannot detect a hung MySQL instance, introduces three configurable ping_type options (select, connect, insert), shows how to modify the configuration and test the behavior with Perl code and shell scripts, and details the internal health‑check workflow.
Background
One night the MySQL database hung, causing application errors such as org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object and login commands becoming unresponsive. The issue was temporarily resolved by forcibly killing the process and restarting the database.
Conclusion
MHA uses a long‑connection ping that runs select 1 as Value ; after four consecutive connection failures it triggers a failover. When the database hangs, new connections cannot be created, but the existing long connection remains alive, so MHA incorrectly assumes the master is healthy.
Solution
Since version 0.53 MHA supports a ping_type parameter to control how master availability is checked. The three possible values are:
select : reuse a long connection to execute select 1 as Value . Simple but may miss many failures.
connect : create a new connection for each ping, execute select 1 as Value , then close it. This can detect TCP‑level problems.
insert : use an existing connection to run an INSERT that records a timestamp, allowing detection of disk‑space or I/O exhaustion.
Setting ping_type=connect forces MHA to open a fresh connection on each health check; if the connection fails, the failover is triggered.
Testing
Example MHA configuration (excerpt):
[server default]
manager_log=/Data/mha/log/workdir/my3306tst.log
manager_workdir=/Data/mha/workdir/my3306tst
remote_workdir=/Data/mysql/my3306/mha
master_binlog_dir=/Data/mysql/my3306/log
password=xxx
ping_interval=5
repl_password=xxx
repl_user=xxx
ssh_user=mysql
ssh_port=xxx
user=mha
master_ip_online_change_script="/usr/local/bin/master_ip_online_change"
master_ip_failover_script="master_ip_failover"
[server1]
hostname=xxx
port=3306
candidate_master=1
[server2]
hostname=xxx
port=3306
candidate_master=1To simulate a server that cannot accept new connections, a simple C program that spins forever is compiled and run on all CPUs, then two mysqlslap load generators are started. With ping_type=connect , four failed connection attempts cause MHA to log an error like:
Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'waiting for initial communication packet', system error: 110)When ping_type=select is used, the same situation does not trigger a failover.
MHA Health Check Mechanism
The call chain is:
MasterMonitor.pm|MHA::MasterMonitor::main()
→ MHA::MasterMonitor::wait_until_master_is_dead()
→ MHA::MasterMonitor::wait_until_master_is_unreachable()
→ MHA::HealthCheck::wait_until_unreachable()
→ MHA::HealthCheck::ping_select() / ping_insert() / ping_connect()The wait_until_unreachable() function repeatedly attempts to connect to the master, counts consecutive failures, and after four failures checks SSH reachability and decides whether the master is down. It forks a child process to run the selected ping function; if the child exits with code 0 the master is considered healthy, otherwise the error counter is incremented and failover logic may be invoked.
Key parts of the implementation (excerpt):
sub wait_until_unreachable($) {
my $self = shift;
my $log = $self->{logger};
my $error_count = 0;
eval {
while (1) {
# try to (re)connect if needed
if ($self->{_need_reconnect}) {
my ($rc, $mysql_err) = $self->connect(...);
if ($rc) {
# ignore known alive error codes
if (grep $_ == $mysql_err, @MHA::ManagerConst::ALIVE_ERROR_CODES) {
$log->info("Got MySQL error $mysql_err, but this is not a crash. Continue health check..");
$self->sleep_until();
next;
}
$error_count++;
$log->warning("Connection failed $error_count time(s)..");
$self->handle_failing();
if ($error_count >= 4) {
$self->{_need_reconnect} = 1;
last if $self->is_secondary_down();
$error_count = 0;
}
$self->sleep_until();
next;
}
$self->{_need_reconnect} = 0;
$log->info("Ping($self->{ping_type}) succeeded, waiting until MySQL doesn't respond..");
}
# disconnect if ping_type is connect
$self->disconnect_if() if $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT;
# fork child to run the selected ping
my $child_exit_code;
eval {
if ($self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT) {
$child_exit_code = $self->fork_exec(sub { $self->ping_connect() }, "MySQL Ping($self->{ping_type})");
} elsif ($self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_SELECT) {
$child_exit_code = $self->fork_exec(sub { $self->ping_select() }, "MySQL Ping($self->{ping_type})");
} elsif ($self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_INSERT) {
$child_exit_code = $self->fork_exec(sub { $self->ping_insert() }, "MySQL Ping($self->{ping_type})");
} else { die "Not supported ping_type!\n"; }
};
if ($@) { $log->error("Unexpected error when pinging! $@"); $child_exit_code = 1; }
if ($child_exit_code == 0) {
$self->update_status_ok();
$error_count = 0;
$self->kill_sec_check();
$self->kill_ssh_check();
} else {
$error_count++;
$self->{_need_reconnect} = 1;
$self->handle_failing();
}
$self->sleep_until();
}
$log->warning("Master is not reachable from health checker!");
};
# final return values omitted for brevity
}The article also lists MySQL error codes that are considered “alive” and therefore do not trigger a failover (e.g., 1040, 1045, 1129, etc.).
Key Takeaways
Default MHA ping may miss failures because it reuses a long‑lived connection.
Changing ping_type to connect or insert provides more reliable detection of master unavailability.
Proper configuration of ping_interval and testing with simulated load helps verify failover behavior.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.