Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE
To cope with rapid growth, Bilibili replaced its inflexible PXE workflow with a hybrid system using in‑memory BiliOS and iPXE, adding out‑of‑band management, declarative configuration, and multi‑scenario support, which together dramatically boosted provisioning automation, reliability, and efficiency across its data‑center and edge servers.
Background
With the rapid growth of Bilibili's user base and services, the scale and complexity of its data centers have increased dramatically. The early provisioning workflow relied on a traditional PXE-based installation system, which proved inflexible and inefficient as the number of IDC and edge servers grew and business scenarios diversified.
The team therefore explored a new provisioning system capable of handling large‑scale new deliveries, re‑installations, data‑center migrations, CDN server deployments, and other complex scenarios.
Scenario 1: New Delivery Provisioning
The process consists of three steps: rack mounting, OS installation, and environment configuration (see diagram).
Challenges include:
Improving rack‑mount efficiency and reducing manual configuration for massive server deliveries.
Accurately maintaining server information (e.g., MAC addresses) to support customized OS and kernel installations.
Providing a reliable, rollback‑capable initialization workflow.
Scenario 2: Complex Network Provisioning
Network heterogeneity across IDC rooms leads to situations where PXE boot fails unless a switch is manually re‑configured. Large broadcast domains, VLAN segmentation, and diverse switch vendors further complicate the process, especially for CDN edge nodes where PXE cannot reach the internal network.
Goals
Support multiple provisioning scenarios (new delivery, re‑install, CDN, data‑center migration).
Enable customized installations (different OS versions, kernel parameters, packages).
Increase automation coverage and success rate while maintaining quality.
Evolution of the Provisioning System
Starting from traditional PXE, the solution introduces new components to meet the above goals.
Traditional PXE Provisioning
PXE allows a machine without a local OS to boot over the network, obtain an IP via DHCP, download a boot image via TFTP, and launch an installer. While suitable for large‑scale deployments, PXE suffers from limited flexibility, complex configuration management, and reliability issues tied to network stability.
Overall Architecture
The new architecture adds several blue‑highlighted modules (see diagram) that address flexibility, automation, and out‑of‑band management.
BiliOS Memory System
BiliOS is an in‑memory OS that boots via PXE, runs entirely in RAM, and does not depend on local disks. Its built‑in Agent collects hardware information (MAC, serial number, DHCP IP, etc.) and communicates with the provisioning management platform to apply BIOS/RAID updates, configure out‑of‑band parameters, and provide a minimal environment for diagnostics.
During the first PXE boot, BiliOS gathers out‑of‑band data (e.g., BMC IP, gateway) automatically, eliminating manual configuration. The reported MAC address is then used to generate a PXE configuration for the target OS installation. In re‑install scenarios, the Server Agent can query the MAC directly, allowing a single PXE boot to complete the installation.
Replacing PXE with iPXE
iPXE is an open‑source PXE implementation that supports additional protocols (HTTP/HTTPS), scripting, and error recovery. By loading the boot image over HTTP, iPXE mitigates the unreliability of UDP‑based TFTP, especially for large initrd files.
Out‑of‑Band Provisioning
For networks where PXE cannot succeed, an out‑of‑band provisioning path is used. It leverages dedicated management interfaces (IPMI, iLO, DRAC) to install the OS without relying on DHCP or PXE. Custom images containing server name, IP, gateway, and mask are mounted via the management card and booted from CD/DVD.
Customized Delivery
Post‑installation configuration is split into in‑band (kernel parameters, services, packages) and out‑of‑band (BIOS, BMC) parts. Declarative baseline management replaces ad‑hoc shell scripts, providing version‑controlled, target‑state configurations with gray‑release capabilities and real‑time monitoring.
Overall Provisioning Workflow
The combined flow for new delivery and complex network scenarios is illustrated below (blue steps indicate differences).
Conclusion and Outlook
Introducing BiliOS and iPXE has dramatically improved automation coverage, success rate, and overall provisioning efficiency for Bilibili’s data centers. Future work includes migrating legacy BIOS boot to UEFI with LinuxBoot, further enhancing reliability and performance, and extending support to edge scenarios.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.