Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
This article describes how Bilibili redesigned its log service by replacing Elasticsearch with ClickHouse, introducing OpenTelemetry‑based logging, optimizing storage, query, and alerting components, and enhancing ClickHouse features such as configuration tuning, Map types, and implicit columns to achieve higher performance, lower cost, and better observability.
Logs are a critical tool for online troubleshooting and observability, and a log system must balance stability, cost, usability, and scalability. Bilibili's original Elastic Stack‑based log system (Billions) has been in production since 2017, now running on over 500 machines and ingesting more than 700 TB of logs per day.
Several issues emerged as the system grew: high write‑throughput bottlenecks in Elasticsearch, expensive storage due to low compression, memory pressure, the need for frequent sampling and rate‑limiting, costly dynamic mapping, lack of lifecycle management before ES 7, complex Kibana upgrades, and a custom JSON‑based SDK with limited performance.
To address these problems, Bilibili designed Log Service 2.0, moving log storage to ClickHouse, building a custom visualization platform, and adopting OpenTelemetry as a unified log reporting protocol.
The new pipeline consists of four stages: collection → ingestion → storage → analysis. Key components include:
OTEL Logging SDK : high‑performance structured logging SDK for Golang and Java implementing the OpenTelemetry logging model.
Log‑Agent : a daemon deployed on physical hosts that receives OTEL logs via a domain socket and performs low‑latency file collection, supporting multiple formats and basic processing.
Log‑Ingester : consumes logs from Kafka, partitions them by time and metadata, and batches writes into ClickHouse.
ClickHouse : columnar storage with high compression and implicit columns for dynamic schema, delivering 10× write throughput and 2× query speed compared to Elasticsearch at one‑third the cost.
Log‑Query : provides routing, load‑balancing, caching, rate‑limiting, and a simplified query syntax.
BLS‑Discovery : a self‑developed visual analysis platform offering Kibana‑like UI with zero learning curve.
Key design details:
3.1 ClickHouse‑based Log Storage
Using ClickHouse’s high‑compression columnar format and implicit columns, the system achieved 10× write throughput and reduced storage cost to 1/3 of the previous system. Structured fields see a 2× query speed improvement, with 99 % of queries completing within three seconds.
3.2 Query Gateway
The gateway abstracts the underlying ClickHouse tables, providing SQL‑style queries without exposing hidden columns or cluster details, and integrates a Luence‑to‑SQL parser for seamless API migration.
3.3 Visual Analysis Platform
A custom UI mimics Kibana’s ergonomics while adding features such as query highlighting, field distribution analysis, time‑series previews, and instant SQL aggregation for rapid log investigation.
3.4 Log Alerting
Alert rules are defined with attributes such as data source, time window, calculation interval, functions (count, sum, distinct), filter expressions, trigger conditions, channels, and storm suppression. Over 5 000 alerts have been migrated from the ES‑based system.
3.5 OpenTelemetry Logging
OpenTelemetry provides a unified API for logs, metrics, and traces. Bilibili implemented stable OTEL logging SDKs for Golang and Java and integrated an OTEL‑compatible collector into Log‑Agent.
3.6 Solving Log Search Challenges
For large‑scale logs, secondary indexes (tokenbf_v1) and token‑based operators enable fast ID‑based lookups, while encouraging users to filter by logger name or source line to limit scan ranges.
Example code (original unstructured log):
log.Info("report id=32 created by user 4253")After structuring:
log.Infov(log.KVString("log_type","report_created"), log.KVInt("report_id",32), log.KVInt("user_id",4253))4 ClickHouse Enhancements and Optimizations
4.1 Configuration Tuning
Addressed "Too many parts" by adjusting batch sizes, merge parameters (min_bytes_for_wide_part, max_bytes_to_merge_at_min_space_in_pool, background_pool_size) and handling Zookeeper load with auxiliary clusters.
4.2 Dynamic Map Type
Introduced Map(String, String) to store dynamic schema fields, but native Map lacks indexing and incurs read amplification.
4.3 Map Implementation
Native Map stores data as Array(Tuple(key, value)), causing unnecessary reads for unrelated keys.
4.4 Map Index Support
Added tokenbf_v1 indexes on each map key to prune granules during queries.
4.5 Implicit Columns for Map
Each map key is materialized as a separate column (implicit column), enabling column‑level reads and index support. Implemented as a new MapV2 type.
4.6 Implicit Column Write Path
During map deserialization, each key is written to its dedicated column; missing keys receive default values to keep row counts consistent.
4.7 Query Tests
Created test tables:
CREATE TABLE bloom_filter_map (
`id` UInt32,
`map` Map(String, String),
INDEX map_index map TYPE tokenbf_v1(128, 3, 0) GRANULARITY 1
) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 2;
-- insert data
insert into bloom_filter_map values (1, {'k1':'v1','k2':'v2'});
insert into bloom_filter_map values (2, {'k1':'v1_2','k3':'v3'});
insert into bloom_filter_map values (3, {'k4':'v4','k5':'v5'});
-- query
select map['key1'] from bloom_filter_map;Implicit columns dramatically reduced I/O for key‑specific queries, and the approach has been contributed upstream (PR #28511).
5 Future Work
5.1 Log Pattern Extraction
Develop mechanisms to extract patterns from unstructured logs for compression, post‑processing, and anomaly detection.
5.2 Lakehouse Integration
Leverage low‑cost lake storage for long‑term retention (e.g., compliance logs) and enable downstream analytics such as machine learning and BI.
5.3 ClickHouse Full‑Text Search
Explore new data structures and indexing strategies to close the gap with Elasticsearch in full‑text scenarios.
Reference links:
OpenTracing
OpenCensus
OpenTelemetry
OpenTelemetry Collector
ClickHouse PR 28511
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.