Anomaly Detection and Attribution Diagnosis Practices at Ant Financial
This article presents Ant Financial's practical approaches to anomaly detection and attribution diagnosis, detailing the underlying concepts, four methodological categories, specific algorithms such as VBEM, AnoSVGD and Autoformer, multi‑dimensional factor analysis, real‑world challenges, and operational benefits for KPI monitoring and incident response.
Introduction – The article shares Ant Financial's practice of anomaly detection and attribution diagnosis, focusing on three aspects: attribution diagnosis, anomaly detection, and problems & challenges.
Attribution Diagnosis – Explains how to explain KPI changes by comparing factor contributions between baseline and current periods, constructing a factor tree, and calculating each leaf's contribution to the overall delta. It supports multi‑granularity comparison, single‑metric and multi‑factor attribution, dimensional combinations, and sub‑second intelligent responses on massive data.
Example: payment success rate drops from 80% to 60%; naive city‑level contributions sum incorrectly, so a refined logic considers numerator/denominator scaling, revealing true causes.
Four methodological categories: control‑variable method (simple arithmetic), chain‑rule method (complex arithmetic), Shapley value method (cooperative game for multiplicative scenarios), and ratio‑type method.
Anomaly Detection
1. Single‑Metric Anomaly Detection – Detects when a metric deviates from its normal fluctuation range using STL‑style decomposition, lowess trend extraction, period identification (FFT + ACF), and adaptive baseline adjustment. Supports multi‑sensitivity, online feedback, unsupervised incremental metric ingestion, and millisecond‑level performance.
2. Multi‑Metric Anomaly Detection – Scores each server by aggregating multiple metrics. Defines data matrices X^j for each server and outlines three algorithms:
VBEM – Variational Bayesian EM with hidden state inference, predicting next‑step values and using chi‑square testing.
AnoSVGD – Density estimation via Stein variational gradient descent, iteratively refining PDF to detect low‑probability anomalies.
Autoformer – Time‑series decomposition using auto‑correlation, extracting periodic and trend components, and forecasting with iterative decoding.
Each algorithm combines model predictions with business‑defined sensitivity to flag anomalies.
Integration of Detection and Attribution – After detecting anomalies, the system computes each metric’s contribution using the attribution diagnosis framework, enabling precise root‑cause analysis (e.g., high response time and failure rate indicating a timeout issue requiring a rollback).
System Capabilities – Supports multi‑sensitivity control, online real‑time tuning, unsupervised incremental metric onboarding, multi‑granularity time data (minute/hour), real‑time attribution, and sub‑second performance.
Challenges – Attribution faces Simpson’s paradox when comparison groups are heterogeneous; anomaly detection must handle frequency leakage, non‑stationary series, and evolving trends and periods.
Conclusion – The presented methods and algorithms provide a comprehensive, scalable solution for KPI monitoring, anomaly detection, and attribution diagnosis across individual services and entire clusters.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.