Operations 15 min read

Design and Implementation of a Cross‑Platform Real‑Time Troubleshooting System for Live Streaming

The team built a cross‑platform real‑time troubleshooting system for live streaming that adds critical‑business monitoring and a unified trace_id‑based tracing framework, simplifies OpenTracing, iterates reporting components, handles multi‑threading, stitches telemetry into searchable event chains, and via dashboards cut diagnosis time from two hours to five minutes, achieving a 91% fault‑resolution rate.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Implementation of a Cross‑Platform Real‑Time Troubleshooting System for Live Streaming

Live streaming services have strong real‑time requirements, high complexity, long troubleshooting chains, and a large impact scope. When an online issue cannot be resolved immediately, every second degrades user experience and streamer revenue.

Often the symptoms observed on different ends (mobile, PC, web, server) are only surface manifestations. A seemingly simple video stutter may involve encoder configuration, network bandwidth allocation, server load, etc., causing multi‑hour human effort to locate the root cause.

To address this, a high‑efficiency cross‑platform real‑time troubleshooting system was built.

Key measures include:

Critical‑business monitoring: Real‑time instrumentation was added to key APIs, broadcast channels, and core processing logic, enriched with contextual information to improve accuracy and completeness of fault location.

Unified tracing system: A global trace_id field was introduced to link all instrumentation points across endpoints. The trace_id is stored in the data layer and visualized on dashboards, dramatically improving traceability.

The results were significant: cross‑department collaboration efficiency increased, fault‑resolution rate reached 91%, and average diagnosis time dropped from 2 hours to 5 minutes. System stability and user experience also improved.

Technical solution details

The design is based on concepts from OpenTracing (trace, span, level, type). Because the standard OpenTracing model did not fit the live‑streaming scenario, the team simplified the model, retained the trace_id and event context, and extended it with custom fields.

Reporting component evolution

Three versions were iterated:

Version 1 – quick feasibility: basic fields (trace_id, level, type, etc.) were passed as function parameters, leading to verbose and intrusive code.

Version 2 – usability boost: a aggregation layer encapsulated parameters into an event model, providing default values and reducing the call to a single line.

Version 3 – robustness: a state‑machine‑based directed‑graph tracks node transitions, handles multi‑threading, and prevents trace_id loss or mis‑association.

To solve multi‑threading issues, trace_id is cached and propagated across threads, with special handling for network requests and broadcasts. The directed‑graph also distinguishes cross‑endpoint and cross‑thread actions.

Data processing and storage

Massive telemetry is cleaned, normalized, and linked into complete event chains using single‑trace and multi‑trace stitching algorithms. Stream‑processing pipelines filter trace‑related events, clean anomalies, and store results within a 5‑minute latency window, supporting flexible queries.

Visualization

Dashboards present the stitched traces, allowing developers, testers, product managers, operations, and support staff to quickly locate anomalies across key scenarios such as app launch, live start/stop, mic‑up/down, and PK sessions.

In production, the system has resolved numerous real‑world incidents, reduced troubleshooting time, and improved overall platform reliability.

Live StreamingObservabilityPerformance Monitoringsystem reliabilityDistributed Tracingreal-time tracing
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.