Operations 21 min read

Bilibili Data Center Migration: Planning, Execution, and Lessons Learned

This article details Bilibili’s 18‑month, multi‑region data‑center migration, covering background, project challenges, comprehensive planning, execution steps, risk management, automation, and post‑migration benefits, offering practical insights for large‑scale infrastructure relocation and operational optimization.

DevOps Operations Practice

Oct 31, 2024

Bilibili Data Center Migration: Planning, Execution, and Lessons Learned

Over an 18‑month period, Bilibili migrated tens of thousands of servers and switches across multiple regions, establishing a new data center with advanced infrastructure and comprehensive technical support to improve resource utilization, operational stability, and user experience.

The migration was driven by rapid business growth that rendered the original, fragmented data centers old, saturated, costly, and poorly scalable; a high‑frequency rolling migration approach was selected to maintain service continuity.

The project faced significant difficulty due to its large scale, long duration, complex scheduling, and coordination among numerous teams including system, resource operations, infrastructure, procurement, business units, and external vendors.

The overall migration scheme comprises project evaluation, an integrated plan, pre‑preparation, batch planning, business migration planning, and emergency response measures.

Project evaluation examined the current state, performed cost analysis showing substantial savings, and assessed risks (business, commercial, and manpower), concluding that migration would yield notable cost benefits and a rapid ROI.

The detailed overall plan outlines procurement, new‑site preparation, design, implementation, and acceptance phases, presented in a structured table covering tasks such as network preparation, rack layout, automation design, and rollout.

Pre‑preparation steps include compiling a complete equipment inventory, selecting a professional relocation vendor, designing the new‑site layout, establishing dedicated inter‑site links, and procuring redundant resources.

The migration bus adopts a rolling mode, moving roughly 500 devices per week (up to 1,700 in a batch) within a two‑week cycle, allowing parallel execution of certain steps to boost efficiency.

Business migration separates online and offline services: online workloads use a single AZ with container‑driven drift and automatic rollback, while offline workloads rely on automated data‑migration tools; data stores such as MySQL, TiDB, and Redis are migrated with standard SOPs, and traffic switching is handled via CDN and SLB.

Project management follows a matrix organization with clear role division (procurement, system, resource operations) and strict process control, including approval workflows for each migration stage.

Emergency plans provide per‑batch rollback procedures, reserve redundant batches for unexpected delays, and address equipment failures, transport issues, and other contingencies.

Execution includes detailed device rack planning that respects power, space, and network constraints, using an automated algorithm to maximize rack utilization and generate placement tables.

The automated migration workflow adds silent shutdown periods, pre‑configures IPs for the new site, updates host status to avoid false alarms, optionally reinstalls the OS, performs system initialization, and conducts baseline verification to ensure delivery quality.

Physical relocation emphasizes logistics preparation, site surveys, proper labeling, insurance for high‑value equipment, and post‑move verification to guarantee safety and accuracy.

Delivery consistency checks BIOS/BMC settings, OS services, and custom configurations to ensure uniformity across all handed‑over devices.

A comprehensive checklist enumerates steps from demand confirmation, vendor selection, planning, execution, acceptance, to final handover.

In the concluding outlook, the new data center aligns with China’s "dual‑carbon" goals, achieving lower PUE, reduced energy consumption, and annual cost savings of nearly 100 million RMB, while boosting CPU utilization, network efficiency, and overall platform stability for future growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Project Management Bilibili Data Center Migration infrastructure operations

Written by

DevOps Operations Practice

We share professional insights on cloud-native, DevOps & operations, Kubernetes, observability & monitoring, and Linux systems.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.