Big Data 12 min read

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

This article introduces Bilibili's one-stop big data cluster management platform (BMR), which was developed to address the rapid growth and increasing complexity of the company's big data services. The platform has evolved through four main stages: survival (standardization and rapid iteration), subsistence (metadata management and scenario-based construction), prosperity (containerization and capacity management), and common prosperity (observability and service quality).

The article details the challenges faced during platform development, including node consistency issues, standardization implementation, large-scale management, iteration efficiency, and peak shaving. It then presents the platform's technical architecture, which consists of cluster management, component management, change control, and resource management modules.

The platform supports various big data components including HDFS, Spark, Flink, ClickHouse, Kafka, and others. It provides capabilities such as application iteration, configuration updates, smooth start/stop, elastic scaling, fault self-healing, containerization, service mixing, and tidal retreat. The article includes detailed tables showing the current status of support for different components and their capabilities.

Future plans include further cost reduction through resource optimization, efficiency improvement through expanded fault self-healing and prediction, enhanced stability through increased coverage and standardization, and improved service quality management.

scalabilityObservabilityresource optimizationContainerizationCluster Managementmetadata managementfault self-healingbig data platform
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.