How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper
An in‑depth look at Bilibili’s multi‑layer monitoring overhaul, detailing the shift from a monolithic Zabbix setup to micro‑service‑based ELK, Dapper, Misaka, Traceon and Lancer systems, and how layered observability improves fault detection across business, application, and infrastructure levels.
1. Monitoring System Layers
Business Layer: Monitoring focuses on business metrics such as hotel order volume on Ctrip, product purchase metrics on Dianping and Vipshop, real‑time business, and Bilibili’s registration success rate.
Application Layer:
Endpoint monitoring – e.g., an app cannot open in Hebei, data is collected to find the cause.
Link‑level (APM) monitoring – e.g., tracing the full order flow on Vipshop to locate anomalies.
Log monitoring – reviewing historical logs, TLF, etc., to detect issues.
System Layer: Covers network, AOC, CDN quality, middleware, database problems, etc.
Initially Bilibili used only Zabbix, a bottom‑up approach with low efficiency; the layered model guides the subsequent improvements.
2. Evolution of Bilibili’s Monitoring System
Improvement steps:
Developed an ELK log analysis platform so developers can view logs without repeated logins.
Migrated the monolithic “giant stone” architecture to micro‑services, revealing difficulties in pinpointing root causes.
Implemented a Google‑inspired Dapper system for rapid issue location.
Built the Misaka system to collect link reports from PC and mobile clients.
Created the Traceon system to monitor business metrics and exceptions, delivering reports to content, product teams and sending alerts via SMS, email, etc., to operations, developers, product and support.
3. How Monitoring Entry Points Help Developers Find Issues
Internal metrics show that core‑business failures are detected within 5 minutes, while non‑core failures take up to 20 minutes.
Monitoring entry categories:
Dashboard: Shows changes, current alerts, failure rates, and full‑link views.
Frontend: Monitors service quality from the client side.
Exception: Aggregates failure rates and exception statistics.
Business: Tracks metrics such as submission success rates.
Link: Enables querying specific business call chains.
System: Monitors core network, CDN, IDC, etc.
Personnel responsible for each entry can focus on their domain to detect and resolve problems.
4. Dapper System
Bilibili’s environment spans multiple languages and data centers, making fault diagnosis difficult. Inspired by Google Dapper, each request is recorded as a trace tree, capturing all work information.
Dapper uses sampling (fixed 1/1024, variable, controllable) to reduce storage pressure while maintaining high hit rates for fault detection.
Example code snippet (illustrative):
HTP request interception code …By integrating tracing into services, Bilibili can generate dependency graphs, identify bottlenecks during high‑traffic events such as Double 11, and streamline cross‑department troubleshooting.
5. Lancer System
Early deployments collected logs on Docker, VMs, and physical machines, leading to uncontrolled small‑packet transmission and loss. Lancer aggregates small packets into larger ones, reports via Log Docker, and stores them in Elasticsearch.
The web UI lets users select a service and trigger automatic log collection, which is then buffered by syslog, log‑agent, sys‑agent and Kafka before being indexed.
6. Misaka System
Misaka provides endpoint monitoring for mobile and PC clients. It embeds front‑end instrumentation to report data, enabling detection of issues such as DNS or traffic hijacking.
7. Traceon System
Traceon collects business metrics and exceptions via instrumentation. It can automatically create entities (e.g., orders) if they are missing, and aggregates high‑volume counters in Redis before syncing to MySQL.
Alert rules forward anomalies to the alarm system, notifying the responsible teams.
8. Outlook
Each sub‑system becomes more independent and accountable.
Dashboards and alert mechanisms become more accurate.
Monitoring processes and on‑call practices are further refined.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.