Design and Implementation of Bilibili AI Compute Network: Topology, Hardware Selection, Load Balancing, and Monitoring
Bilibili designed and deployed an AI compute network for large language model training, choosing a Fat-Tree topology, selecting high‑speed switches, optical modules, and fibers, implementing fixed‑path load balancing, and building a sub‑second telemetry monitoring platform, with plans to scale to ten‑thousand GPUs.
Since the rise of generative AI represented by ChatGPT, large language models (LLMs) have become a focal point for enterprises. Training trillion‑parameter models requires thousands to tens of thousands of GPUs, which in turn demand a zero‑packet‑loss, low‑latency, high‑throughput AI compute network to interconnect the GPUs.
Bilibili's network team designed and deployed an AI compute network based on business requirements and industry best practices. This article summarizes the key design factors and choices.
2. AI Network Topology
2.1 Fat‑Tree – A non‑blocking CLOS architecture derived from binary trees. All uplink and downlink ports have equal bandwidth, achieving a 1:1 bandwidth convergence design. It provides low latency and maximizes throughput for various traffic patterns. The typical 2‑layer and 3‑layer Fat‑Tree configurations are described with port counts and switch numbers (e.g., a 2‑layer Fat‑Tree with P ports per switch needs 3·P/2 switches).
2.2 Dragonfly – A low‑diameter, cost‑effective direct‑connect topology consisting of Switch, Group, and System layers. It reduces the number of network elements by 20%‑40% compared with Fat‑Tree but has lower software maturity and higher operational complexity.
2.3 InfiniBand – A high‑performance, low‑latency technology that uses link‑level flow control and adaptive routing. It typically adopts a Fat‑Tree topology for AI clusters, offering superior throughput at higher cost and a more closed ecosystem.
After comparing the three topologies, Bilibili chose the Fat‑Tree design for its AI compute cluster.
3. Hardware Selection
3.1 Switches – Two AI clusters are built for different GPU servers. The overseas‑type servers use 8×200 Gbps NICs and 2×25 Gbps storage NICs, connected to a Broadcom Tomahawk 4 chassis switch (64×400 Gbps ports). The domestic‑type servers use 8×400 Gbps NICs; due to incompatibility with overseas servers, a separate box‑box topology using commercial chips was adopted.
3.2 Optical Modules – Critical for converting electrical signals to optical. The number of modules is 4‑6 times the number of GPU servers. A table of common module types and transmission distances is provided in the original article.
3.3 Fibers – Multimode (OM1‑OM5) and single‑mode (OS1/OS2) fibers are used according to distance, power, and cost requirements. Multimode OM4 is common for AI clusters; single‑mode OS2 is used for longer links.
4. Overall Network Design
The AI cluster consists of two independent networks: a Frontend Network (data ingest, checkpoints, logs) using two 25 Gbps uplinks per server, and a Backend Network for large‑model training, built as a two‑layer Fat‑Tree with SPINE and LEAF switches. For domestic GPUs, a POD‑aware affinity design ensures that NICs with the same index connect to the same uplink, achieving one‑hop communication within a POD.
5. Load Balancing
AI traffic exhibits low entropy, burstiness, and elephant flows, causing hash‑based per‑flow load balancing to create hotspots. Bilibili adopted a fixed‑path strategy (NSLB) by tagging 32 logical 200 Gbps ports on each LEAF with indexes and evenly hashing them to 16 uplink 400 Gbps ports, achieving up to 98% link utilization under normal conditions.
Research on Distributed Cell (DDC) networking is ongoing to replace per‑flow hashing with cell‑based load balancing using VOQ and credit‑based flow control.
6. Monitoring Platform
Traditional SNMP is too slow for sub‑millisecond AI traffic. Bilibili built a telemetry‑based monitoring system that pushes device data via gRPC/UDP using YANG models and GPB encoding, providing sub‑second visibility of port queues, PFC counters, and congestion hotspots.
7. Future Outlook
Bilibili plans to scale from thousand‑GPU to ten‑thousand‑GPU clusters, continue evaluating new network architectures, enhance the monitoring platform, and implement controller‑driven traffic engineering to proactively avoid and quickly resolve congestion.
References
[1] Next‑Gen Data Transfer: SFP112/QSFP112/QSFP‑DD800/OSFP 800G DAC [2] Optimized Network Architectures for Large Language Model Training with Billions of Parameters [3] 面向超万卡集群的新型智算技术白皮书(2024年) [4] DDC 技术白皮书 [5] Dragonfly 直连拓扑技术白皮书 [6] RDMA over Ethernet for Distributed AI Training at Meta Scale
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.