Databases 28 min read

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

This article describes how Bilibili redesigned its log service by replacing Elasticsearch with ClickHouse, introducing OpenTelemetry‑based logging, optimizing storage, query, and alerting components, and enhancing ClickHouse features such as configuration tuning, Map types, and implicit columns to achieve higher performance, lower cost, and better observability.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

Logs are a critical tool for online troubleshooting and observability, and a log system must balance stability, cost, usability, and scalability. Bilibili's original Elastic Stack‑based log system (Billions) has been in production since 2017, now running on over 500 machines and ingesting more than 700 TB of logs per day.

Several issues emerged as the system grew: high write‑throughput bottlenecks in Elasticsearch, expensive storage due to low compression, memory pressure, the need for frequent sampling and rate‑limiting, costly dynamic mapping, lack of lifecycle management before ES 7, complex Kibana upgrades, and a custom JSON‑based SDK with limited performance.

To address these problems, Bilibili designed Log Service 2.0, moving log storage to ClickHouse, building a custom visualization platform, and adopting OpenTelemetry as a unified log reporting protocol.

The new pipeline consists of four stages: collection → ingestion → storage → analysis. Key components include:

OTEL Logging SDK : high‑performance structured logging SDK for Golang and Java implementing the OpenTelemetry logging model.

Log‑Agent : a daemon deployed on physical hosts that receives OTEL logs via a domain socket and performs low‑latency file collection, supporting multiple formats and basic processing.

Log‑Ingester : consumes logs from Kafka, partitions them by time and metadata, and batches writes into ClickHouse.

ClickHouse : columnar storage with high compression and implicit columns for dynamic schema, delivering 10× write throughput and 2× query speed compared to Elasticsearch at one‑third the cost.

Log‑Query : provides routing, load‑balancing, caching, rate‑limiting, and a simplified query syntax.

BLS‑Discovery : a self‑developed visual analysis platform offering Kibana‑like UI with zero learning curve.

Key design details:

3.1 ClickHouse‑based Log Storage

Using ClickHouse’s high‑compression columnar format and implicit columns, the system achieved 10× write throughput and reduced storage cost to 1/3 of the previous system. Structured fields see a 2× query speed improvement, with 99 % of queries completing within three seconds.

3.2 Query Gateway

The gateway abstracts the underlying ClickHouse tables, providing SQL‑style queries without exposing hidden columns or cluster details, and integrates a Luence‑to‑SQL parser for seamless API migration.

3.3 Visual Analysis Platform

A custom UI mimics Kibana’s ergonomics while adding features such as query highlighting, field distribution analysis, time‑series previews, and instant SQL aggregation for rapid log investigation.

3.4 Log Alerting

Alert rules are defined with attributes such as data source, time window, calculation interval, functions (count, sum, distinct), filter expressions, trigger conditions, channels, and storm suppression. Over 5 000 alerts have been migrated from the ES‑based system.

3.5 OpenTelemetry Logging

OpenTelemetry provides a unified API for logs, metrics, and traces. Bilibili implemented stable OTEL logging SDKs for Golang and Java and integrated an OTEL‑compatible collector into Log‑Agent.

3.6 Solving Log Search Challenges

For large‑scale logs, secondary indexes (tokenbf_v1) and token‑based operators enable fast ID‑based lookups, while encouraging users to filter by logger name or source line to limit scan ranges.

Example code (original unstructured log):

log.Info("report id=32 created by user 4253")

After structuring:

log.Infov(log.KVString("log_type","report_created"), log.KVInt("report_id",32), log.KVInt("user_id",4253))

4 ClickHouse Enhancements and Optimizations

4.1 Configuration Tuning

Addressed "Too many parts" by adjusting batch sizes, merge parameters (min_bytes_for_wide_part, max_bytes_to_merge_at_min_space_in_pool, background_pool_size) and handling Zookeeper load with auxiliary clusters.

4.2 Dynamic Map Type

Introduced Map(String, String) to store dynamic schema fields, but native Map lacks indexing and incurs read amplification.

4.3 Map Implementation

Native Map stores data as Array(Tuple(key, value)), causing unnecessary reads for unrelated keys.

4.4 Map Index Support

Added tokenbf_v1 indexes on each map key to prune granules during queries.

4.5 Implicit Columns for Map

Each map key is materialized as a separate column (implicit column), enabling column‑level reads and index support. Implemented as a new MapV2 type.

4.6 Implicit Column Write Path

During map deserialization, each key is written to its dedicated column; missing keys receive default values to keep row counts consistent.

4.7 Query Tests

Created test tables:

CREATE TABLE bloom_filter_map (
    `id` UInt32,
    `map` Map(String, String),
    INDEX map_index map TYPE tokenbf_v1(128, 3, 0) GRANULARITY 1
) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 2;

-- insert data
insert into bloom_filter_map values (1, {'k1':'v1','k2':'v2'});
insert into bloom_filter_map values (2, {'k1':'v1_2','k3':'v3'});
insert into bloom_filter_map values (3, {'k4':'v4','k5':'v5'});

-- query
select map['key1'] from bloom_filter_map;

Implicit columns dramatically reduced I/O for key‑specific queries, and the approach has been contributed upstream (PR #28511).

5 Future Work

5.1 Log Pattern Extraction

Develop mechanisms to extract patterns from unstructured logs for compression, post‑processing, and anomaly detection.

5.2 Lakehouse Integration

Leverage low‑cost lake storage for long‑term retention (e.g., compliance logs) and enable downstream analytics such as machine learning and BI.

5.3 ClickHouse Full‑Text Search

Explore new data structures and indexing strategies to close the gap with Elasticsearch in full‑text scenarios.

Reference links:

OpenTracing

OpenCensus

OpenTelemetry

OpenTelemetry Collector

ClickHouse PR 28511

ObservabilityOpenTelemetryClickHouseLog StorageDatabase Optimizationlog infrastructure
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.