Alipay’s Technical Risk System: Building SRE, TRaaS, and AIOps for High Availability
The article details how Alipay’s technical risk team, led by researcher Chen Liang, evolved from early scalability work to a full‑stack SRE organization, created the TRaaS risk‑defense platform and integrated AIOps to achieve near‑five‑nine availability and automated self‑healing for its financial services.
In an interview with InfoQ, Alipay technical risk researcher Chen Liang (aka Junyi) explains that true technical differentiation emerges only when ideas are executed to the extreme, and he shares the evolution of Alipay’s risk‑management architecture that underpins events like Double‑Eleven.
Chen joined Alipay in 2007, initially working on search and middleware, then leading the transaction‑splitting and three‑generation unit‑based architectures that set the company’s standards for database sharding and multi‑active disaster recovery.
Early on, Alipay’s monolithic system faced scalability limits, prompting a shift to database splitting and later to an active‑active multi‑region design to handle massive traffic spikes.
In 2013 the company launched a “Quality 2.0” strategy, forming a dedicated technical risk department that later became the Technical Risk Department in 2015, focusing on systematic risk mitigation beyond traditional testing.
By 2016 Chen built Alipay’s first SRE (Site Risk Engineer) team—China’s earliest SRE group—combining development, operations, and DBA expertise, emphasizing software‑driven risk control rather than manual ops.
The SRE team introduced automated fault localization, adaptive disaster recovery, anti‑shake mechanisms, and fine‑grained high‑availability that can isolate risk down to individual transactions.
To validate reliability, a “technical blue‑team” was created in 2017 to continuously attack the defense system, while a “red‑team” collaborated with business units to improve resilience.
In recent years the team launched the TRaaS (Technological Risk‑defense as a Service) platform, offering 99.999% availability, trillion‑level real‑time fund reconciliation, and five‑minute self‑healing through AIOps integration.
AIOps at Alipay leverages AI models trained on massive monitoring data to assist root‑cause analysis, automate alert handling, and expand monitoring coverage, especially for fund‑security scenarios.
Looking forward, Chen envisions a fully cloud‑native, AI‑augmented risk platform that operates with minimal human intervention, eventually achieving fully unattended change management.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.