Big Data 13 min read

Design and Implementation of Bilibili's Lancer Log Collection System

The article presents the architecture, component design, optimizations, and reliability guarantees of Bilibili's Lancer log collection system, a Flume‑based distributed pipeline that handles both real‑time and offline data streams for billions of events daily.

Architecture Digest

Sep 7, 2017

Design and Implementation of Bilibili's Lancer Log Collection System

Bilibili's (referred to as B‑Station) log collection system, named Lancer, is responsible for gathering and transmitting logs from all business services, providing both offline and real‑time data to satisfy computation and subscription needs. Built on Flume, the system addresses the growing data volume and the increasing demands for timeliness, stability, and reliability.

Problems with the previous big‑data collection service:

Insufficient system capacity: native Flume performance issues, heterogeneous system support difficulties, and lack of a unified protocol layer.

Chaotic instrumentation integration: missing or incorrect data points, data loss, no automated onboarding, and insufficient monitoring.

Incomplete data coverage: low terminal coverage and insufficient business scenario coverage.

Overall Architecture

The new design targets high throughput, low latency, data safety, timeliness, and high availability with disaster recovery to ensure zero data loss. The system consists of two data flows: a real‑time stream for event‑level reporting and an offline stream for batch synchronization (e.g., via Sqoop).

In the real‑time flow, data sources (servers and clients) send logs through a unified SDK (Tcp/Udp/LogStream or Http(s)) to the Lancer‑Gateway, which writes events to a Kafka buffer. The collector layer then pulls data from Kafka and stores it in HDFS, Hive, Elasticsearch, HBase, etc., for downstream consumption.

The offline flow uses Sqoop to batch‑sync database data, but its details are omitted.

Flume‑based Gateway and Distribution Layers

Flume, an Apache top‑level project, provides a distributed, reliable, and highly available log collection framework composed of Sources, Channels, and Sinks. Events flow from Sources to Channels (buffers) and finally to Sinks (destinations).

1) Gateway Layer – Lancer‑Gateway

The gateway supports multiple protocols (LogStreamSource, SysLogUdpSource, SysLogTcpSource, NetSource). It uses a Reactor‑based NIO thread model: an Acceptor thread from a main pool binds ports, creates SocketChannels, registers them to other Reactor threads for authentication, then hands them to Sub‑threads for I/O processing, which decode data and write to a KafkaSink.

Optimizations include upgrading Netty3 to Netty4 for a better thread model and implementing a private LogStream protocol over TCP for higher transmission efficiency.

2) Distribution Layer – Lancer‑Collector

The collector is another Flume agent with a KafkaSource that pulls events from the buffer and routes them to different Channels and Sinks based on the target storage (HDFS, Kafka, MySQL, etc.). Physical isolation of channels prevents a slow sink from throttling the entire pipeline.

Key optimizations:

Physical isolation of different business data streams to avoid the "slowest link" bottleneck.

Load‑balancing across Channels, enlarging MemoryChannel capacity, and increasing HdfsSink batch size to reduce flush frequency.

Data Reliability Guarantees

SDKs (e.g., GoAgent) persist data locally before upload, preventing loss on network failure.

Kafka as the buffering layer ensures no data loss if downstream consumers fail.

Flume’s transactional Channel design guarantees exactly‑once delivery; each transaction uses separate take/put queues and commits atomically.

Data Quality Assurance

Fine‑grained monitoring of loss rate, latency, volume, and processing time, visualized in a dedicated dashboard.

Alerting mechanisms notify developers of anomalies for rapid response.

Daily statistics and trend reports (day‑over‑day, month‑over‑month) for business data.

Real‑time data sample inspection for business teams.

The monitoring subsystem architecture is shown below:

Future Development

To date, Lancer processes over 400 billion events per day (≈20 TB, 3 million events per second) across more than 200 data‑collection tasks. Future work includes further system optimization, a graphical log‑management console, and open‑source contributions to data‑integration solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Big Data data pipeline Kafka log collection Flume

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.