Databases 9 min read

Analysis of MHA Master Crash Failover Process and Source Code

This article examines the MHA open‑source MySQL high‑availability solution, detailing the master‑crash failover workflow, source‑code analysis, configuration checks, binlog handling, new‑master selection, and slave recovery, and provides a concise step‑by‑step checklist for practitioners.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Analysis of MHA Master Crash Failover Process and Source Code

Introduction: MHA has been an open‑source MySQL HA solution for nearly a decade, but it is now outdated; nevertheless the author examines its master‑crash failover logic.

Source code analysis: The main failover routine is shown, highlighting configuration checks, SSH connectivity, binlog handling, GTID vs non‑GTID paths, selection of a new master, and recovery of slaves. Key functions such as do_master_failover , init_config , force_shutdown , save_master_binlog , and recover_slaves are described.

sub main {
    ...
    eval { $error_code = do_master_failover(); };
    if ($@) { $error_code = 1; }
    if ($error_code) { finalize_on_error(); }
    return $error_code;
    ...
    sub do_master_failover {
        my $error_code = 1; # error code
        my ($dead_master, $new_master);
        eval {
            my ($servers_config_ref, $binlog_server_ref) = init_config();
            $log->info("Starting master failover.");
            $log->info("* Phase 1: Configuration Check Phase..\n");
            MHA::ServerManager::init_binlog_server($binlog_server_ref, $log);
            $dead_master = check_settings($servers_config_ref);
            if ($_server_manager->is_gtid_auto_pos_enabled()) {
                $log->info("Starting GTID based failover.");
            } else {
                $_server_manager->force_disable_log_bin_if_auto_pos_disabled();
                $log->info("Starting Non-GTID based failover.");
            }
            $log->info("* Phase 1: Configuration Check Phase completed.\n");
            $log->info("* Phase 2: Dead Master Shutdown Phase..\n");
            force_shutdown($dead_master);
            $log->info("* Phase 2: Dead Master Shutdown Phase completed.\n");
            $log->info("* Phase 3: Master Recovery Phase..\n");
            check_set_latest_slaves();
            if (!$_server_manager->is_gtid_auto_pos_enabled()) {
                $log->info("* Phase 3.2: Saving Dead Master's Binlog Phase..\n");
                save_master_binlog($dead_master);
            }
            $log->info("* Phase 3.3: Determining New Master Phase..\n");
            my $latest_base_slave;
            if ($_server_manager->is_gtid_auto_pos_enabled()) {
                $latest_base_slave = $_server_manager->get_most_advanced_latest_slave();
            } else {
                $latest_base_slave = find_latest_base_slave($dead_master);
            }
            $new_master = select_new_master($dead_master, $latest_base_slave);
            my ($master_log_file, $master_log_pos, $exec_gtid_set) = recover_master($dead_master, $new_master, $latest_base_slave, $binlog_server_ref);
            $new_master->{activated} = 1;
            $log->info("* Phase 3: Master Recovery Phase completed.\n");
            $log->info("* Phase 4: Slaves Recovery Phase..\n");
            $error_code = recover_slaves($dead_master, $new_master, $latest_base_slave, $master_log_file, $master_log_pos, $exec_gtid_set);
            if ($g_remove_dead_master_conf && $error_code == 0) {
                MHA::Config::delete_block_and_save($g_config_file, $dead_master->{id}, $log);
            }
            cleanup();
        };
        if ($@) {
            if ($dead_master && $dead_master->{not_error}) { $log->info($@); }
            else { MHA::ManagerUtil::print_error("Got ERROR: $@", $log); }
            $_server_manager->disconnect_all() if $_server_manager;
            undef $@;
        }
        eval { send_report($dead_master, $new_master); };
        return $error_code;
    }
}

Step‑by‑step summary: (1) Check configuration, node versions, SSH reachability, and slave status; (2) Shut down the failed master’s IO threads and run VIP failover scripts; (3) Gather slave status, save binlogs from the dead master, and determine the most advanced slave; (4) Choose a new master based on GTID, replication lag, and candidate flags; (5) Recover the new master, start replication on slaves, and clean up.

The article concludes with a concise checklist of the failover process and references to related MySQL troubleshooting posts.

High AvailabilityMySQLDatabase ReplicationMHAfailoverPerlmaster crash
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.