Operations 10 min read

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

The recent Alipay outage during Double‑11 revealed a partial failure in its system message database, prompting users to experience payment errors, duplicate charges, and delayed withdrawals, while the company’s response highlighted the importance of comprehensive backup, redundancy, disaster‑recovery planning, monitoring, and security measures to ensure service continuity.

Efficient Ops
Efficient Ops
Efficient Ops
Why Alipay Crashed: Lessons on Backup and Disaster Recovery

Alipay Outage Overview

During the Double‑11 shopping festival, many users reported payment failures, duplicate charges, and delayed withdrawals when using Alipay. The incident caused widespread frustration and raised concerns about the platform’s reliability.

Root Cause and Official Response

Alipay’s official statement attributed the issue to a localized failure in its system message store, assuring that user funds remained safe and that the problem was resolved by 10:50 AM.

System Message Store Explained

The system message store is a database that handles the receipt, storage, transmission, and retrieval of messages between system components and users, enabling inter‑module communication.

Backup Strategies

Full Backup: Periodic complete copies of all data, suitable for small datasets or when absolute data integrity is required.

Incremental Backup: Captures only changes since the last backup, reducing storage and time; restores require the latest full backup followed by incremental sets.

Differential Backup: Saves changes since the last full backup, balancing speed and storage; restoration uses the latest full backup plus the most recent differential.

Redundancy Strategies

Hardware Redundancy: Clustered servers and RAID arrays ensure continued operation if a single device fails.

Software Redundancy: Redundant database instances, message queues, and load balancers provide seamless failover.

Data Redundancy: Replication to multiple data centers or cloud storage, using synchronous or asynchronous methods.

Disaster Recovery Planning

Define detailed recovery procedures, including backup activation and system restart steps.

Set Recovery Time Objective (RTO) and Recovery Point Objective (RPO) based on business criticality.

Establish an emergency response team comprising administrators, DBAs, network engineers, and security experts.

Monitoring and Alerting

Real‑time monitoring of server metrics, database connections, and message queue lengths.

Configure alert thresholds to notify staff via SMS, email, or messaging apps when anomalies occur.

Testing and Drills

Regularly perform recovery tests to verify backup integrity and process effectiveness.

Conduct tabletop and simulated drills to keep the response team proficient.

Security Measures

Encrypt backup data and rotate encryption keys regularly.

Enforce strict access controls using authentication and ACLs.

Historical Outages

Alipay has experienced multiple outages in recent years, including incidents on October 21 2024 during a Double‑11 payment flow, April 9 2024 affecting various services, the August 14 2021 “Valentine’s Day” promotion, and earlier failures in 2019 and 2015 caused by network glitches and fiber cuts.

operationsSREdisaster recoveryBackupAlipaysystem-outage
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.