Operations 4 min read

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

Efficient Ops
Efficient Ops
Efficient Ops
Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

Alipay Full‑Ecosystem Availability Monitoring – Background and Challenges

The 24th GOPS Global Operations Conference and Research‑Operation Intelligence Summit was successfully held in Shanghai on October 18‑19, 2024. The two‑day event focused on hot topics such as large models, DevOps, SRE, AIOps, BizDevOps, cloud‑native and security, with special tracks covering large‑model‑plus‑operations/testing, digital transformation in banking and securities, platform engineering, DevOps/AIOps best practices, and leading internet companies. Tang Liang, Head of Alipay Ecosystem Monitoring Assurance, delivered a talk titled “Technical System and Application of Alipay’s Full‑Ecosystem Availability Monitoring Assurance” (unauthorized reproduction prohibited).

Alipay Full‑Ecosystem Monitoring Assurance Architecture

The presentation described the overall technical architecture that enables end‑to‑end availability monitoring across Alipay’s entire ecosystem, integrating metrics collection, real‑time alerting, and automated remediation pipelines. The design emphasizes scalability, fault tolerance, and seamless integration with existing DevOps and SRE workflows.

Pre‑Risk Assurance Practices in the Alipay Ecosystem

Key risk‑prevention measures were highlighted, including proactive health checks, synthetic transaction monitoring, and predictive anomaly detection powered by AIOps. These practices aim to identify potential service degradations before they impact users, thereby maintaining high availability standards.

Monitoring System Construction and Practice

The talk concluded with concrete implementation details, such as the deployment of distributed tracing, centralized logging, and automated incident response playbooks. Real‑world case studies demonstrated how these components work together to achieve rapid detection and resolution of issues across the Alipay ecosystem.

For further details, the full PPT is available at: https://pan.baidu.com/s/1hpb2zy7qO-JNeDWjdLwa_Q?pwd=ih8b

Monitoringcloud-nativeDevOpsSREAIOpsavailability
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.