Big Data 12 min read

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Bilibili Tech

Jul 19, 2024

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

This article introduces Bilibili's one-stop big data cluster management platform (BMR), which was developed to address the rapid growth and increasing complexity of the company's big data services. The platform has evolved through four main stages: survival (standardization and rapid iteration), subsistence (metadata management and scenario-based construction), prosperity (containerization and capacity management), and common prosperity (observability and service quality).

The article details the challenges faced during platform development, including node consistency issues, standardization implementation, large-scale management, iteration efficiency, and peak shaving. It then presents the platform's technical architecture, which consists of cluster management, component management, change control, and resource management modules.

The platform supports various big data components including HDFS, Spark, Flink, ClickHouse, Kafka, and others. It provides capabilities such as application iteration, configuration updates, smooth start/stop, elastic scaling, fault self-healing, containerization, service mixing, and tidal retreat. The article includes detailed tables showing the current status of support for different components and their capabilities.

Future plans include further cost reduction through resource optimization, efficiency improvement through expanded fault self-healing and prediction, enhanced stability through increased coverage and standardization, and improved service quality management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scalability Observability Resource Optimization Containerization Cluster Management Metadata Management fault self-healing big data platform

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.