Big Data 17 min read

Bilibili One‑Stop Big Data Cluster Management Platform (BMR): Architecture, Modules, and Future Outlook

Bilibili's One‑Stop Big Data Cluster Management Platform (BMR) unifies cluster, metadata, intelligent operations, and custom managers to oversee 50+ services, 10,000 machines, exabyte storage, and millions of cores, using cloud‑native containers, fault prediction, and resource‑sharing techniques to boost efficiency, stability, and cost savings.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili One‑Stop Big Data Cluster Management Platform (BMR): Architecture, Modules, and Future Outlook

The rapid growth of Bilibili's business has led to a massive increase in the scale and complexity of its big‑data workloads. To address this challenge, Bilibili built the Bilibili One‑Stop Big Data Cluster Management Platform (BMR), which integrates cluster management, metadata warehouse (元仓), intelligent operations, and customized managers.

Background : Launched in 2021, BMR now manages more than 50 service components, over 10,000 machines, exabyte‑level storage, and millions of CPU cores. Traditional CI/CD pipelines could not keep up with the fast‑growing demand, prompting the development of a dedicated big‑data component management system.

Development Stages :

Stage 1 – Standardized environment and service configuration, eliminating technical debt from ad‑hoc growth.

Stage 2 – Built a metadata warehouse to unify business, metric, and fault data, and expanded component coverage.

Stage 3 – Adopted cloud‑native containerization (K8s) for Spark/Flink migration, introduced capacity management, SLO monitoring, and early intelligent‑operation features.

Stage 4 – Enhanced intelligent operations with fault prediction, large‑model‑based Q&A, and customized managers for fine‑grained control.

Cluster Management : Provides core capabilities such as cluster, service, configuration, and package management. Supports over 50 daily change events affecting tens of thousands of machines. Features include visual DAG‑based workflow editing, gray‑release, fast rollback, configuration checks, change‑defense mechanisms, and health‑check metrics. It also handles heterogeneous environments, lifecycle management, and cross‑component coordination.

Cost‑Saving Strategies : Implements tidal deployment (day‑night workload shifting) and elastic scaling, saving >1,000 machines and >60,000 CPU cores by sharing resources between big‑data and online services.

Metadata Warehouse (元仓) : Stores unified data (business metadata, golden metrics, fault logs). Enables cross‑host/component data exchange, ensures metadata consistency, and supports historical analysis. Applications include capacity/quota management, SLO definition, host health diagnostics, and task‑level performance analysis.

Intelligent Operations :

Inspection platform – proactive risk detection and on‑demand inspection tasks.

Fault self‑healing – automated detection, diagnosis, and remediation for disk failures, IO hangs, process crashes, etc.

Smart Q&A – large‑model‑driven assistant that answers host health, task diagnostics, and provides direct links to scheduling platforms.

Customized Managers :

Flink Manager – manages >7,000 Flink jobs, >3,000 hosts, supports templates, gray releases, and migration during data‑center moves.

Kafka Manager – oversees >40 Kafka clusters (≈500 machines, >10,000 topics), offering cluster and topic management, rate‑limiting, and partition migration.

Spark Manager (in development) – will handle >200,000 daily Spark jobs, multi‑version management via OneClient, and integrate with a big‑data testing platform.

Future Outlook : Continue building a comprehensive big‑data testing platform, strengthen change‑control with more defense points, enhance capacity management, risk prediction, and self‑healing, and explore broader large‑model applications in vertical domains to further improve stability and efficiency.

Overall, BMR demonstrates how a unified platform can improve change efficiency, ensure safe operations, and optimize resource utilization for large‑scale big‑data services.

big dataDevOpscluster managementIntelligent OperationsBMRmetadata warehouse
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.