Operations 10 min read

How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper

An in‑depth look at Bilibili’s multi‑layer monitoring overhaul, detailing the shift from a monolithic Zabbix setup to micro‑service‑based ELK, Dapper, Misaka, Traceon and Lancer systems, and how layered observability improves fault detection across business, application, and infrastructure levels.

Efficient Ops
Efficient Ops
Efficient Ops
How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper

1. Monitoring System Layers

Business Layer: Monitoring focuses on business metrics such as hotel order volume on Ctrip, product purchase metrics on Dianping and Vipshop, real‑time business, and Bilibili’s registration success rate.

Application Layer:

Endpoint monitoring – e.g., an app cannot open in Hebei, data is collected to find the cause.

Link‑level (APM) monitoring – e.g., tracing the full order flow on Vipshop to locate anomalies.

Log monitoring – reviewing historical logs, TLF, etc., to detect issues.

System Layer: Covers network, AOC, CDN quality, middleware, database problems, etc.

Initially Bilibili used only Zabbix, a bottom‑up approach with low efficiency; the layered model guides the subsequent improvements.

2. Evolution of Bilibili’s Monitoring System

Improvement steps:

Developed an ELK log analysis platform so developers can view logs without repeated logins.

Migrated the monolithic “giant stone” architecture to micro‑services, revealing difficulties in pinpointing root causes.

Implemented a Google‑inspired Dapper system for rapid issue location.

Built the Misaka system to collect link reports from PC and mobile clients.

Created the Traceon system to monitor business metrics and exceptions, delivering reports to content, product teams and sending alerts via SMS, email, etc., to operations, developers, product and support.

3. How Monitoring Entry Points Help Developers Find Issues

Internal metrics show that core‑business failures are detected within 5 minutes, while non‑core failures take up to 20 minutes.

Monitoring entry categories:

Dashboard: Shows changes, current alerts, failure rates, and full‑link views.

Frontend: Monitors service quality from the client side.

Exception: Aggregates failure rates and exception statistics.

Business: Tracks metrics such as submission success rates.

Link: Enables querying specific business call chains.

System: Monitors core network, CDN, IDC, etc.

Personnel responsible for each entry can focus on their domain to detect and resolve problems.

4. Dapper System

Bilibili’s environment spans multiple languages and data centers, making fault diagnosis difficult. Inspired by Google Dapper, each request is recorded as a trace tree, capturing all work information.

Dapper uses sampling (fixed 1/1024, variable, controllable) to reduce storage pressure while maintaining high hit rates for fault detection.

Example code snippet (illustrative):

HTP request interception code …

By integrating tracing into services, Bilibili can generate dependency graphs, identify bottlenecks during high‑traffic events such as Double 11, and streamline cross‑department troubleshooting.

5. Lancer System

Early deployments collected logs on Docker, VMs, and physical machines, leading to uncontrolled small‑packet transmission and loss. Lancer aggregates small packets into larger ones, reports via Log Docker, and stores them in Elasticsearch.

The web UI lets users select a service and trigger automatic log collection, which is then buffered by syslog, log‑agent, sys‑agent and Kafka before being indexed.

6. Misaka System

Misaka provides endpoint monitoring for mobile and PC clients. It embeds front‑end instrumentation to report data, enabling detection of issues such as DNS or traffic hijacking.

7. Traceon System

Traceon collects business metrics and exceptions via instrumentation. It can automatically create entities (e.g., orders) if they are missing, and aggregates high‑volume counters in Redis before syncing to MySQL.

Alert rules forward anomalies to the alarm system, notifying the responsible teams.

8. Outlook

Each sub‑system becomes more independent and accountable.

Dashboards and alert mechanisms become more accurate.

Monitoring processes and on‑call practices are further refined.

MonitoringmicroservicesoperationsObservabilityDistributed Tracinglog aggregation
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.