How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack
This article details Bilibili's Billions log platform—from its fragmented origins and design goals to the elastic‑stack‑based architecture, shard management, log sampling, custom Go splitters, and monitoring enhancements—highlighting the challenges faced and the roadmap for future improvements.
Original Log System
Before Billions, Bilibili had no unified logging platform; each product line used its own solution, ranging from ELK‑based setups to simple tail/grep scripts, leading to high maintenance cost, instability, log loss, and poor usability.
Inconsistent solutions with high maintenance overhead. No unified log format, making parsing and analysis costly. Poor PAAS support for containerized applications. Low utilization of logs beyond simple search.
Design goals for the new system were:
Seamless business log onboarding : simple configuration without exposing business details.
Diversity support : physical/virtual machines, containers, various sources and formats (plain, JSON).
Log mining : fast queries, monitoring, statistical analysis.
System availability : real‑time data, controllable loss rate.
Billions Evolution
System Initialization
Log Specification
All business logs must be JSON with four mandatory fields:
time – ISO8601 timestamp
level – FATAL, ERROR, WARN, INFO, DEBUG
app_id – globally unique application identifier
instance_id – instance identifier defined by the business
Additional details are stored in the
logfield, and extra fields may be added as needed. The mapping must remain stable.
<code>{"log": "hello billions, write more", "level": "INFO", "app_id": "testapp111", "instance_id": "instance1", "time": "2017-08-04T15:59:01.607483", "id": 0}</code>Log Collection
Two paths are used:
Business logs are emitted via a Go‑based
log agentthat writes to a Unix‑domain socket; the agent runs on physical machines and containers, exposing the socket inside containers.
Non‑business logs (middleware, system, ingress) are collected with Filebeat, which supports multi‑line concatenation and tags each entry with
app_id.
The
log agentconsists of a collector (receives logs) and a sender (pushes logs). They share a local file cache to survive transport failures.
Transport Layer
Bilibili uses an internal data transport platform called Lancer, built on Flume + Kafka, providing load balancing, horizontal scalability, reliable protocols, and 24/7 support. Logs from the agent are sent to Kafka topics with a unified prefix.
Log Splitting
Logstash consumes Kafka topics, extracts fields, transforms formats, and indexes data into Elasticsearch. Topics are matched using a
topics_pattern. Partition assignment uses the RoundRobinAssignor to avoid uneven distribution.
For non‑standard logs, separate Logstash instances are required because of single‑event pipeline limitations.
Search and Storage
The Elasticsearch cluster consists of 3 master nodes, 20 hot nodes (SSD), 20 stale nodes (HDD), and 2 client nodes. Hot nodes store real‑time logs; stale nodes hold historical data. Indexes are created a day in advance, and Curator manages lifecycle (default 7‑day retention). Templates enforce mapping consistency.
Monitoring uses a custom exporter that feeds metrics to Prometheus; alerts cover cluster health, thread rejections, node counts, and unassigned shards.
System Iterations
Shard Management
Initially each index used 5 primary + 5 replica shards, causing excess overhead and imbalance for large indexes. A shard‑management module now creates daily indexes with a target size of 30 GB per shard, capping at 15 shards, reducing total shard count by over 70 %.
Log Sampling
For high‑volume services (>500 GB/day), INFO‑level logs are randomly sampled per
app_id, while WARN and above are kept fully. Sampling ratios are stored in a central config service and can be changed dynamically.
Data‑Node Hardware Bottleneck
During peak hours (20:00‑24:00) bulk request rejections and SSD I/O saturation were observed. Adding additional PCIe SSDs to each data node and moving part of the write load to stale nodes alleviated the issue.
Logstash Performance
Logstash struggled to handle 400 k qps; a Go‑based splitter (Billions Index) was developed, achieving >50 k qps with only 150 MB memory, cutting resource usage by half.
Log Monitoring
ElastAlert was extended: rules are stored in Elasticsearch for reliability and distributed execution; a RESTful API manages rules; integration with Bili‑Moni provides alert routing; a global lock ensures high‑availability failover.
Current Issues and Future Work
Lack of permission control : need unified authentication, index‑level ACLs, and audit trails.
Missing full‑link monitoring : end‑to‑end visibility of log loss and backlog.
Monitoring performance bottleneck : single‑node rule execution; plan to move to distributed, parallel rule processing.
Complex log‑splitting configuration : per‑format Logstash instances; aim for a platform‑wide, pluggable solution.
Beyond these improvements, deeper log value mining using Elasticsearch’s aggregation capabilities is a key direction.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.