Operations 21 min read

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

Bilibili Tech

Dec 15, 2023

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili’s rapid growth in user base and services has raised the demand for highly reliable and available systems. To meet this need, the company has undertaken a major iteration of its alert‑monitoring platform, aiming to detect and locate problems quickly and accurately.

Background : The alert platform is critical for all Bilibili services—video playback, live streaming, comment systems, content review, data statistics, etc. It must monitor the health of these services in real time and trigger alerts that enable operators to locate and resolve incidents promptly.

Core Business Demands : Effectiveness (every alert corresponds to a real anomaly), timeliness (high‑priority alerts must be delivered instantly), and coverage/follow‑up (all responsible services should be covered and their alert status visible).

Key Design Goals : Reduce per‑person alerts by 90%, achieve 87.9% root‑cause analysis accuracy, and improve end‑to‑end detection latency to under one minute for 99th‑percentile cases.

1. Alert Platform Design Highlights

Business‑core demands are distilled into three objectives: effectiveness, timeliness, and coverage.

Closed‑Loop Model : Emphasizes alert definition and governance as the two critical loops that drive noise reduction and recall performance.

Alert Access : Supports three scenarios—platform‑wide templates for multi‑tenant rules, custom business‑defined rules, and third‑party event integration.

Alert Calculation Engine : A distributed engine performs multi‑level scheduling across zones and clusters, periodically evaluating data to fire alerts.

Alert Channel : Performs noise reduction (window, frequency, silence, suppression, aggregation) followed by rendering and delivery to users.

2. Alert Governance Practices

Stage 1 – Goal Setting : Reduce weekly alerts from >1000 to ~80 per person.

Stage 2 – Data Analysis : Integrate alert data into a warehouse; compute impact factor = alert count × recipients × noise‑reduction coefficient.

Stage 3 – Governance Actions : Close ineffective rules, calibrate service‑tree ownership, shrink notification scope, and enrich channel noise‑reduction strategies.

Results after six months of intensive governance:

Median alerts per week dropped from 1000 to 74 (≈7.4%).

Total alert notifications fell from >3 million to 220 k (≈7%).

Per‑person alerts reduced from >1600 to 140 (≈8.8%).

3. Root‑Cause Analysis Design

3.1 Background : With an SLO‑driven system, Bilibili defined precise anomaly triggers but still faced a 5‑minute localization bottleneck due to reliance on manual expertise.

3.2 Design 1.0 : Three‑phase approach—listen to SLO alerts, map errors to dimensions (cluster, instance, upstream request), trace linked service calls, and finally map abnormal nodes to a knowledge graph for top‑ranked root‑cause recommendation. Accuracy was acceptable but limited.

3.3 Design 2.0 : Built an “exception knowledge graph” that captures expert experience, defines abnormal entities (applications, databases, caches, etc.), and models propagation relationships. This knowledge‑driven approach improves accuracy and enables systematic knowledge reuse.

3.4 AIOps Support : Implemented algorithms for time‑series prediction, anomaly detection, log clustering, event clustering, and graph‑based correlation. Added GBDT models for enhanced association analysis.

3.5 Challenges & Focus : Evaluating recommendation quality, collecting user feedback, and continuously enriching expert knowledge.

3.6 Case Studies

Upstream traffic surge causing service throttling—identified via SLO‑alert linkage and downstream impact analysis.

Downstream deployment causing gateway‑interface degradation—root cause pinpointed to a specific AI service change.

Redis request anomalies affecting service endpoints—traced through exception propagation paths.

3.7 Effect Evaluation : Daily root‑cause analyses reach 20 652 instances with 87.9% accuracy, 77.5% recall, 95‑percentile latency of 10 s, and average latency of 4 s.

4. Summary & Outlook : A well‑designed alert platform must address diverse business needs through a closed‑loop model, while continuous governance is essential to keep alert volume manageable. Future work will deepen AI integration to further boost detection precision and reduce MTTR.

Q&A (selected): How much of the governance relies on manual rule filtering? Can business‑side alert noise reduction be exemplified? Are there automated post‑alert analysis scenarios? How is notification delivery status managed? How are high‑priority alerts prevented from being drowned out? How to improve root‑cause analysis efficiency and storage of massive logs?

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE Incident Management aiops Root Cause Analysis Bilibili Alert Monitoring

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.