Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
Bilibili’s Log Service 2.0 replaces its Elastic‑Stack pipeline with an OpenTelemetry‑driven architecture that writes logs via high‑performance Go/Java SDKs to ClickHouse, delivering ten‑fold write throughput, two‑fold query speed, one‑third storage cost, a custom query gateway, visualization UI, and advanced alerting.
Logging is a critical means for online troubleshooting and observability. Bilibili has been operating a log system (Billions) based on the Elastic Stack since 2017, serving over 500 machines and ingesting more than 700 TB of logs per day. While ELK provides flexible JSON transport, full‑text search, and a user‑friendly UI (Kibana), rapid business growth exposed several limitations such as high cost, stability issues, and scalability bottlenecks.
Problems identified
Elasticsearch’s tokenization causes CPU‑intensive write bottlenecks and high storage cost, leading to latency and instability.
Low compression ratio and memory pressure force frequent sampling and throttling.
Warm‑stage indices must be closed to free memory, reducing usability.
Dynamic mapping had to be disabled, complicating user queries.
Before ClickHouse 7, lifecycle management components required high maintenance.
Kibana’s code complexity and upgrade coupling increase migration cost.
Internal JSON‑based SDKs for Java and Golang have average serialization performance and compatibility concerns.
New architecture (Log Service 2.0)
The redesign replaces Elasticsearch with ClickHouse for storage, introduces a self‑developed visualization platform, and adopts OpenTelemetry as a unified log reporting protocol. The end‑to‑end pipeline consists of:
OTEL Logging SDK : High‑performance SDKs for Golang and Java implementing the OpenTelemetry logging model.
Log‑Agent : Deployed on physical machines, receives OTEL logs via Domain Socket, performs low‑latency file collection, and supports various formats.
Log‑Ingester : Subscribes to Kafka, partitions logs by time and metadata, and batches writes into ClickHouse.
ClickHouse : Columnar storage with high compression (ZSTD) and implicit columns for dynamic schema.
Log‑Query : Handles routing, load balancing, caching, rate limiting, and simplifies query syntax.
BLS‑Discovery : One‑stop visual analysis platform offering zero‑learning‑cost log search.
ClickHouse‑based log storage
Switching to ClickHouse yields a 10× increase in write throughput and reduces storage cost to one‑third of the previous system. Query performance for structured fields improves by 2×, with 99 % of queries completing within 3 seconds.
Data model:
log.Info("report id=32 created by user 4253")After structuring:
log.Infov(log.KVString("log_type", "report_created"), log.KVInt("report_id", 32), log.KVInt("user_id", 4253))ClickHouse tables contain public fields (OTEL resources, trace_id, span_id) and three implicit‑column maps (string_map, number_map, bool_map) to store dynamic fields efficiently.
Query gateway
The gateway abstracts ClickHouse’s local/distributed tables, implicit columns, and enforces limits (forced LIMIT, time range). It also parses Lucene‑style queries and translates them to SQL, enabling seamless API migration.
Self‑developed visualization platform
The platform mimics Kibana’s UI while providing SQL‑based aggregation, auto‑completion via CodeMirror2, and fast analysis capabilities.
Log alerting
Alert rules are defined with attributes such as data source, time window, calculation interval, functions (count, sum, max, distinct), filter expressions, trigger conditions, channels, and suppression policies. Over 5,000 alert rules have been migrated from ES to the new system.
OpenTelemetry Logging
OpenTelemetry’s stable logging protocol defines a standard model for logs, metrics, and traces. Bilibili implements the OTEL Logging SDK for Golang and Java and integrates an OTEL‑compatible collector in Log‑Agent.
Search optimization
For large‑scale logs, tokenbf_v1 secondary indexes and the ~` operator are used to narrow search ranges. Structured logging is encouraged to improve searchability.
ClickHouse enhancements
Configuration optimizations address “Too many parts” and Zookeeper overload by adjusting merge parameters (min_bytes_for_wide_part, max_bytes_to_merge_*, background_pool_size) and employing auxiliary Zookeeper clusters.
Map type and implicit columns
Native Map suffers from lack of indexing and read amplification. Bilibili adds tokenbf_v1 indexes on map keys and introduces MapV2, which materializes each key as an implicit column, dramatically reducing I/O for key‑specific queries.
CREATE TABLE bloom_filter_map (
`id` UInt32,
`map` Map(String, String),
INDEX map_index map TYPE tokenbf_v1(128, 3, 0) GRANULARITY 1
) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 2;
insert into bloom_filter_map values (1, {'k1':'v1','k2':'v2'});
insert into bloom_filter_map values (2, {'k1':'v1_2','k3':'v3'});
insert into bloom_filter_map values (3, {'k4':'v4','k5':'v5'});
select map['key1'] from bloom_filter_map;Tests show that implicit columns enable index‑driven pruning, yielding significant performance gains over native Map.
Future work
Log pattern extraction for unstructured logs, enabling compression, post‑processing, and anomaly detection.
Integration with lake‑house architectures for long‑term storage and advanced analytics (ML, BI).
Further ClickHouse full‑text search research to close the gap with Elasticsearch.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.