Design and High‑Availability Practices of Bilibili's Video Submission System
Bilibili’s video submission platform uses a layered micro‑service architecture with a DAG‑based scheduler, extensive observability, and HA tactics such as sharding, 64‑bit ID migration, full‑link stress tests, chaos engineering, and multi‑active data‑center deployment, while tooling like trace correlation and automated alerts ensures stability and guides future hybrid‑cloud migration.
Bilibili's video platform relies on a complex manuscript (稿件) production system that supports multiple upload channels (mobile, web, PC, etc.) and content types (UGC, PGC, commercial). The system has evolved over 15 years into a large micro‑service ecosystem with millions of lines of code and billions of daily data records.
The rapid business growth introduced technical debt: outdated components, lack of owners, dirty code, and performance degradation. Key challenges include maintaining stability across many tightly coupled services, handling large‑scale data ingestion, and meeting strict business KPIs such as first‑open latency (M90/M95/M99).
The architecture is layered:
Front‑end: various client entry points and internal submission portals.
Gateway (BFF): API aggregation, rate‑limiting, and security.
Business services: micro‑services split into B‑side (production) and C‑side (consumption) domains.
Content production platform: a DAG‑based scheduler that orchestrates atomic tasks (transcoding, review, AI models, copyright, etc.) with synchronous and asynchronous callbacks.
Metadata storage: a mix of relational databases, KV stores, and distributed storage.
Observability is built on logs, metrics, and OpenTelemetry. Core metrics include upload volume, success rate, QPS, latency, and SLA composition across dependent services. A one‑page dashboard aggregates these signals, enabling rapid detection of anomalies.
High‑availability measures cover:
Storage optimization (MySQL/TiDB/ES/TiKV) with sharding, capacity planning, and DTS‑based data pipelines.
ID migration from INT32 to INT64 to avoid future overflow.
Full‑link write stress testing using shadow resources to verify system capacity.
Chaos engineering experiments to expose hidden dependencies and verify recovery procedures.
Multi‑active deployment across two data‑center zones to ensure upload continuity even during site failures.
Additional tooling includes a full‑link trace system that correlates client‑side origin_trace_id with server‑side trace_id, a Lego‑style backend UI for case‑by‑case investigation, and automated alerting (SRE bots, enterprise WeChat, SMS, phone). The article concludes that while many improvements have been made, further work on hybrid‑cloud migration, complete multi‑active rollout, and automated end‑to‑end testing remains.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.