Big Data 17 min read

Overview of the Berserker Big Data Platform and Its Data Development Architecture

The Berserker big‑data platform provides a one‑stop data development and governance solution built on over 40 micro‑services, featuring the Archer scheduler with CN and EN nodes, Raft‑based state management, Docker‑isolated task execution, smart routing, and plans to make EN stateless, migrate to Kubernetes, and unify batch and streaming services.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Overview of the Berserker Big Data Platform and Its Data Development Architecture

This document introduces the Berserker (狂战士) big data platform, its overall positioning, development history, product functions, and future work.

Platform Overview

The platform is a one‑stop data development and governance solution built on big‑data ecosystem components. It supports data collection, transmission, storage, query, development, analysis, mining, testing, execution, and operations management for various internal roles.

Key user roles and daily tasks include:

Data analysts / product / operations: data map lookup, ad‑hoc queries, simple ETL for temporary monitoring tables, report generation.

Data developers: data integration, routine ETL tasks, data exploration, data‑warehouse model management, data quality and asset governance, data service publishing.

The platform currently runs more than 40 micro‑services. The micro‑service framework uses Bilibili’s Kratos library.

Data Development

Core functionalities cover offline batch scheduling, real‑time stream computing, ETL development, ad‑hoc queries, user development APIs, and an operations center.

Scale: >150k offline tasks, >250k daily routine tasks, >10k task chains, longest chain >40, >4k streaming tasks.

The internal scheduling system, code‑named Archer (弓兵), handles task timing, dependencies, resource dispatch, and node management. Its main components are:

CN (Control Node): scheduling control layer, handling timing, dependencies, throttling, routing, submission, and cluster management.

EN (Execute Node): execution layer that receives tasks from CN and reports status back.

API: web‑level services and external API gateway.

SqIScan: SQL parsing and compilation service.

DataManager: task IDC management and cross‑data‑center replication.

Blackhole: Kerberos unified authentication.

Admin: console for throttling, routing, and EN management.

Architecture & Core Components

The platform consists of several core modules that cooperate to provide scheduling, execution, and monitoring capabilities.

Key Issues and Solutions

State Management Problems : CN and EN maintain state; crashes or restarts caused misfires and duplicate executions. Original design used Zookeeper for CN high‑availability and Redis for state storage, leading to stability and consistency issues.

Current Solution : Replace Zookeeper with Raft library for strong consistency and leader election. CN state is now stored in the Raft state machine, eliminating network partition problems and simplifying architecture.

EN Release Problems : Previously EN releases killed running tasks, causing resource waste and delays. A smooth release process was introduced: CN stops sending tasks to EN, EN completes existing tasks before upgrade, then resumes task acceptance.

Two‑Phase Commit : To avoid task loss during leader switches, Raft records START_DISPATCH, then RPC submits the task, and finally records END_DISPATCH.

RPC Duplicate Submissions : Implemented EN ACK mechanism and timeout‑based status checks to ensure idempotent task submission.

Task Routing & Gray‑Release : Added rule‑engine based routing supporting 50+ attribute combinations, EN tagging for machine/cluster selection, image selection, and gray‑release fallback.

Message Flood from EN : Designed a SmartQueue with high/low watermarks and merge‑type messages to reduce backlog and improve consumption latency.

Execution Management : Integrated Docker via DockerD API to isolate task environments, control resources, and collect logs through LogAgent and cAdvisor. Kerberos tickets are passed to containers for security.

Dependency Model Refactor : Shifted from project‑level dependencies to task‑level dependencies using root and end nodes, enabling zero‑risk migration.

Big Data Operations : Built tools for quick data quality issue handling, supporting both real‑time (night) and post‑process (day) scenarios, with query and one‑click operation capabilities.

Future Work

Planned improvements include:

Making EN stateless by moving task state to DockerD, reducing release time.

Porting CN/EN functionalities to Kubernetes for better resource utilization.

Integrating offline batch and real‑time stream platforms under a unified web entry, shared services (scheduling, SQL parsing, resource management), and a common execution protocol.

These efforts aim to enhance scalability, reliability, and operational efficiency of the big data platform.

Dockerdata-platformschedulingBig DataK8sArcherRaft
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.