Design and Implementation of Bilibili AI Compute Network: Topology, Hardware Selection, Load Balancing, and Monitoring
Bilibili designed and deployed an AI compute network for large language model training, choosing a Fat-Tree topology, selecting high‑speed switches, optical modules, and fibers, implementing fixed‑path load balancing, and building a sub‑second telemetry monitoring platform, with plans to scale to ten‑thousand GPUs.