Backend Development 35 min read

Design and High‑Availability Practices of Bilibili's Video Submission System

Bilibili’s video submission platform uses a layered micro‑service architecture with a DAG‑based scheduler, extensive observability, and HA tactics such as sharding, 64‑bit ID migration, full‑link stress tests, chaos engineering, and multi‑active data‑center deployment, while tooling like trace correlation and automated alerts ensures stability and guides future hybrid‑cloud migration.

Bilibili Tech

May 31, 2024

Design and High‑Availability Practices of Bilibili's Video Submission System

Bilibili's video platform relies on a complex manuscript (稿件) production system that supports multiple upload channels (mobile, web, PC, etc.) and content types (UGC, PGC, commercial). The system has evolved over 15 years into a large micro‑service ecosystem with millions of lines of code and billions of daily data records.

The rapid business growth introduced technical debt: outdated components, lack of owners, dirty code, and performance degradation. Key challenges include maintaining stability across many tightly coupled services, handling large‑scale data ingestion, and meeting strict business KPIs such as first‑open latency (M90/M95/M99).

The architecture is layered:

Front‑end: various client entry points and internal submission portals.

Gateway (BFF): API aggregation, rate‑limiting, and security.

Business services: micro‑services split into B‑side (production) and C‑side (consumption) domains.

Content production platform: a DAG‑based scheduler that orchestrates atomic tasks (transcoding, review, AI models, copyright, etc.) with synchronous and asynchronous callbacks.

Metadata storage: a mix of relational databases, KV stores, and distributed storage.

Observability is built on logs, metrics, and OpenTelemetry. Core metrics include upload volume, success rate, QPS, latency, and SLA composition across dependent services. A one‑page dashboard aggregates these signals, enabling rapid detection of anomalies.

High‑availability measures cover:

Storage optimization (MySQL/TiDB/ES/TiKV) with sharding, capacity planning, and DTS‑based data pipelines.

ID migration from INT32 to INT64 to avoid future overflow.

Full‑link write stress testing using shadow resources to verify system capacity.

Chaos engineering experiments to expose hidden dependencies and verify recovery procedures.

Multi‑active deployment across two data‑center zones to ensure upload continuity even during site failures.

Additional tooling includes a full‑link trace system that correlates client‑side origin_trace_id with server‑side trace_id, a Lego‑style backend UI for case‑by‑case investigation, and automated alerting (SRE bots, enterprise WeChat, SMS, phone). The article concludes that while many improvements have been made, further work on hybrid‑cloud migration, complete multi‑active rollout, and automated end‑to‑end testing remains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Architecture microservices DAG high availability Bilibili Video Submission

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.