MySQL High Availability Architecture and Practices at AutoHome
This article explains MySQL high‑availability concepts, defines HA, RPO and RTO, outlines common HA architectures such as master‑slave+VIP, MHA and MGR+Proxy, and details AutoHome's evolution from simple master‑slave setups to a container‑based MGR solution with automated failover and monitoring platforms.
MySQL, being open‑source, easy to operate, and high‑performance, is the most widely used database at AutoHome; as a critical backend storage component, its high availability (HA) is essential.
Compared with commercial databases, achieving HA with open‑source MySQL requires users to design and develop the solution themselves. This article introduces the development history and implementation practice of AutoHome's MySQL HA architecture.
1. HA definition and metrics
High Availability (HA) refers to a system's ability to operate without interruption, representing its level of availability. Key metrics include Recovery Point Objective (RPO) – the maximum data loss tolerated during a disaster, and Recovery Time Objective (RTO) – the time required to restore the system to a running state.
Figure 1: RPO calculation
Figure 2: RTO calculation
2. MySQL HA challenges
The main HA problem is ensuring that when a MySQL instance crashes, the service remains uninterrupted and data loss is avoided (RPO) while recovery time is short (RTO). Challenges include preventing data loss during sudden master failure, maintaining data consistency across nodes, and achieving automatic failover with minimal business impact.
3. Common MySQL HA architectures
3.1 Master‑Slave Replication + VIP – uses virtual IP for automatic failover, with DBA scripts for manual switch.
Figure 5: Master‑Slave Replication + VIP
3.2 Master‑Slave Replication + MHA – MHA (Master High Availability) is a third‑party tool that, upon master failure, transfers binary logs to slaves and rebuilds the master‑slave topology, ensuring no data loss.
Figure 6: Master‑Slave Replication + MHA
3.3 MySQL Group Replication (MGR) + Proxy – MGR provides HA, strong consistency, and automatic primary election; combined with a proxy, applications can switch to the new primary without reconfiguration.
Figure 7: MGR Replication + Proxy
4. AutoHome MySQL HA practice
4.1 Development stages – (1) Master‑Slave + VIP era (pre‑2016), (2) Master‑Slave + MHA era (since 2016), (3) MGR + automation platform era (since 2020), each improving fault detection, automatic failover, and data consistency.
Figure 8: AutoHome MySQL HA evolution
4.2 HA operation platform – consists of three parts: MGR replication architecture, a Prometheus‑based monitoring platform that detects master failures, and an automated operation platform that performs failover within 2‑3 minutes.
Figure 9: HA design diagram
4.3 Containerized MySQL HA – MySQL runs in Kubernetes; the MySQL‑Operator monitors master status every 10 seconds and triggers the HA module after three consecutive failures, achieving failover in 1‑2 minutes.
Figure 11: Container deployment of MySQL HA
5. Future plans
AutoHome intends to further tune the cluster to avoid automatic master switches caused by network jitter or large transactions, and to explore intelligent self‑healing mechanisms for database faults.
In summary, AutoHome’s MySQL HA solution combines MGR replication, monitoring, and an automated operation platform to provide rapid, automatic failover for both physical and containerized MySQL instances, ensuring high service stability.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.