Operations 17 min read

How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack

This article details Bilibili's Billions log platform—from its fragmented origins and design goals to the elastic‑stack‑based architecture, shard management, log sampling, custom Go splitters, and monitoring enhancements—highlighting the challenges faced and the roadmap for future improvements.

Efficient Ops
Efficient Ops
Efficient Ops
How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack

Original Log System

Before Billions, Bilibili had no unified logging platform; each product line used its own solution, ranging from ELK‑based setups to simple tail/grep scripts, leading to high maintenance cost, instability, log loss, and poor usability.

Inconsistent solutions with high maintenance overhead. No unified log format, making parsing and analysis costly. Poor PAAS support for containerized applications. Low utilization of logs beyond simple search.

Design goals for the new system were:

Seamless business log onboarding : simple configuration without exposing business details.

Diversity support : physical/virtual machines, containers, various sources and formats (plain, JSON).

Log mining : fast queries, monitoring, statistical analysis.

System availability : real‑time data, controllable loss rate.

Billions Evolution

System Initialization

Log Specification

All business logs must be JSON with four mandatory fields:

time – ISO8601 timestamp

level – FATAL, ERROR, WARN, INFO, DEBUG

app_id – globally unique application identifier

instance_id – instance identifier defined by the business

Additional details are stored in the

log

field, and extra fields may be added as needed. The mapping must remain stable.

<code>{"log": "hello billions, write more", "level": "INFO", "app_id": "testapp111", "instance_id": "instance1", "time": "2017-08-04T15:59:01.607483", "id": 0}</code>

Log Collection

Two paths are used:

Business logs are emitted via a Go‑based

log agent

that writes to a Unix‑domain socket; the agent runs on physical machines and containers, exposing the socket inside containers.

Non‑business logs (middleware, system, ingress) are collected with Filebeat, which supports multi‑line concatenation and tags each entry with

app_id

.

The

log agent

consists of a collector (receives logs) and a sender (pushes logs). They share a local file cache to survive transport failures.

Transport Layer

Bilibili uses an internal data transport platform called Lancer, built on Flume + Kafka, providing load balancing, horizontal scalability, reliable protocols, and 24/7 support. Logs from the agent are sent to Kafka topics with a unified prefix.

Log Splitting

Logstash consumes Kafka topics, extracts fields, transforms formats, and indexes data into Elasticsearch. Topics are matched using a

topics_pattern

. Partition assignment uses the RoundRobinAssignor to avoid uneven distribution.

For non‑standard logs, separate Logstash instances are required because of single‑event pipeline limitations.

Search and Storage

The Elasticsearch cluster consists of 3 master nodes, 20 hot nodes (SSD), 20 stale nodes (HDD), and 2 client nodes. Hot nodes store real‑time logs; stale nodes hold historical data. Indexes are created a day in advance, and Curator manages lifecycle (default 7‑day retention). Templates enforce mapping consistency.

Monitoring uses a custom exporter that feeds metrics to Prometheus; alerts cover cluster health, thread rejections, node counts, and unassigned shards.

System Iterations

Shard Management

Initially each index used 5 primary + 5 replica shards, causing excess overhead and imbalance for large indexes. A shard‑management module now creates daily indexes with a target size of 30 GB per shard, capping at 15 shards, reducing total shard count by over 70 %.

Log Sampling

For high‑volume services (>500 GB/day), INFO‑level logs are randomly sampled per

app_id

, while WARN and above are kept fully. Sampling ratios are stored in a central config service and can be changed dynamically.

Data‑Node Hardware Bottleneck

During peak hours (20:00‑24:00) bulk request rejections and SSD I/O saturation were observed. Adding additional PCIe SSDs to each data node and moving part of the write load to stale nodes alleviated the issue.

Logstash Performance

Logstash struggled to handle 400 k qps; a Go‑based splitter (Billions Index) was developed, achieving >50 k qps with only 150 MB memory, cutting resource usage by half.

Log Monitoring

ElastAlert was extended: rules are stored in Elasticsearch for reliability and distributed execution; a RESTful API manages rules; integration with Bili‑Moni provides alert routing; a global lock ensures high‑availability failover.

Current Issues and Future Work

Lack of permission control : need unified authentication, index‑level ACLs, and audit trails.

Missing full‑link monitoring : end‑to‑end visibility of log loss and backlog.

Monitoring performance bottleneck : single‑node rule execution; plan to move to distributed, parallel rule processing.

Complex log‑splitting configuration : per‑format Logstash instances; aim for a platform‑wide, pluggable solution.

Beyond these improvements, deeper log value mining using Elasticsearch’s aggregation capabilities is a key direction.

monitoringBig Dataoperationslog managementelastic stackdistributed logging
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.