Design and Implementation of Vivo's Bees Log Collection Agent
This article presents the design principles, core techniques, and practical solutions of Vivo's self‑developed Bees log collection agent, covering file discovery, unique identification, real‑time and offline ingestion, checkpointing, resource control, platform management, and a comparison with open‑source alternatives.
In the construction of enterprise big‑data systems, data acquisition is the first and most critical step; existing open‑source collectors often cannot meet large‑scale, governed ingestion needs, prompting many companies to develop custom agents. This article shares Vivo's experience designing the Bees log collection agent.
Overview
Data acquisition gathers logs from various sources (application, server, database, IoT) into big‑data storage. Log files are the most common source, and the article outlines a typical architecture consisting of a log‑collection agent, transport/storage (Kafka, HDFS), and a management platform.
Key Features & Capabilities
Real‑time and offline log collection
Non‑intrusive file monitoring
Custom filtering, matching, formatting, and rate‑limiting
Second‑level latency, breakpoint‑resume, and visual task management
Rich monitoring metrics and low resource consumption
Design Principles
Simplicity and elegance
Robustness and stability
Key Design Details
4.1 Log file discovery & listening
Static file lists are insufficient for rotating logs; Bees uses directory paths with wildcard or regex patterns. For efficient change detection, it combines Linux inotify with a fallback polling mechanism. The Java implementation uses WatchService :
/**
* 订阅文件或目录的变更事件
*/
public synchronized BeesWatchKey watchDir(File dir, WatchEvent.Kind
... watchEvents) throws IOException {
if (!dir.exists() && dir.isFile()) {
throw new IllegalArgumentException("watchDir requires an exist directory, param: " + dir);
}
Path path = dir.toPath().toAbsolutePath();
BeesWatchKey beesWatchKey = registeredDirs.get(path);
if (beesWatchKey == null) {
beesWatchKey = new BeesWatchKey(subscriber, dir, this, watchEvents);
registeredDirs.put(path, beesWatchKey);
logger.info("successfully watch dir: {}", dir);
}
return beesWatchKey;
}
public synchronized BeesWatchKey watchDir(File dir) throws IOException {
WatchEvent.Kind
[] events = {
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_DELETE,
StandardWatchEventKinds.ENTRY_MODIFY
};
return watchDir(dir, events);
}4.2 Unique file identification
File names are unreliable for uniqueness; Bees combines the inode number with a 128‑byte content hash to form a stable identifier. Example of obtaining the inode:
ls -i access.log
62651787 access.logSignature generation:
public static String signFile(File file) throws IOException {
String filepath = file.getAbsolutePath();
String sign = null;
RandomAccessFile raf = new RandomAccessFile(filepath, "r");
if (raf.length() >= SIGN_SIZE) {
byte[] tbyte = new byte[SIGN_SIZE];
raf.seek(0);
raf.read(tbyte);
sign = Hashing.sha256().hashBytes(tbyte).toString();
}
return sign;
}4.3 Log content reading
To read continuously appended logs, Bees uses RandomAccessFile with a movable pointer, enabling efficient tail‑reading and checkpointing:
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] buffer;
private void readFile() {
if ((raf.length() - raf.getFilePointer()) < BUFFER_SIZE) {
buffer = new byte[(int) (raf.length() - raf.getFilePointer())];
} else {
buffer = new byte[BUFFER_SIZE];
}
raf.read(buffer, 0, buffer.length);
}4.4 Checkpoint & resume
Agents record the current file pointer after successful Kafka transmission and persist it to a local JSON file (example below) every few seconds. On restart, the agent loads the checkpoint and seeks to the saved position, guaranteeing no data loss or duplication.
[
{
"file": "/home/sample/logs/bees-agent.log",
"inode": 2235528,
"pos": 621,
"sign": "cb8730c1d4a71adc4e5b48931db528e30a5b5c1e99a900ee13e1fe5f935664f1"
}
]4.5 Real‑time data sending
Data is sent directly to Kafka via a Netty‑based RPC client, with an intermediate bees‑bus component that aggregates traffic, provides cross‑region failover, and buffers spikes.
4.6 Offline collection
For batch analytics, Bees writes completed hourly logs to HDFS using FSDataOutputStream . A rate‑limiting mechanism smooths the burst at the top of each hour to protect network bandwidth and disk I/O.
4.7 Log cleanup strategy
A shell script deletes logs older than six hours. Even after unlinking, the agent can continue reading an open file handle until the data is fully consumed.
4.8 Resource consumption & control
CPU isolation via taskset
JVM heap limits (default 512 MB, minimum 64 MB)
Disk I/O throttling (e.g., 3‑5 MB/s) and monitoring
Network bandwidth monitoring and adaptive rate control
4.9 Self‑monitoring
Agents also collect their own logs, sending them through the same pipeline to Kafka → Elasticsearch → Kibana for unified visibility.
4.10 Platform management
A centralized UI provides version view, heartbeat, task dispatch, start/stop, and rate‑limit control via periodic HTTP heartbeats.
Comparison with Open‑Source Agents
Much lower memory footprint (no Channel, JVM as low as 64 MB)
Sub‑second latency using inotify vs. polling in Flume
Accurate file identity via inode + signature
Thread‑level isolation per topic
Graceful shutdown without data loss
Richer metrics (rate, progress, JVM stats, GC count)
Extensible filtering, formatting, and platform integration
Conclusion
The Bees agent has been in production since 2019, serving tens of thousands of instances and ingesting petabytes of logs daily. Its design—covering discovery, unique identification, real‑time/offline ingestion, checkpointing, resource control, and platform management—demonstrates a robust, scalable solution for enterprise‑level log collection.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.