Design and Implementation of Vivo's Bees Log Collection Agent
This article presents the design principles, core features, and implementation details of Vivo's self‑developed Bees log collection agent, covering file discovery, unique identification, real‑time and offline ingestion, resource control, platform management, and comparisons with open‑source solutions.
In enterprise big‑data systems, data collection is the first and crucial step; traditional open‑source collectors often cannot meet large‑scale, governed requirements, prompting Vivo to develop its own Bees log collection service.
Key Features include real‑time and offline log file collection, non‑intrusive file monitoring, custom filtering, rate limiting, second‑level latency, breakpoint‑resume, centralized task management, rich metrics, and low resource overhead.
Design Principles are simplicity, elegance, robustness, and stability.
File Discovery & Listening : Bees uses a combination of Linux inotify events (via java.nio.file.WatchService ) and a fallback polling mechanism to efficiently detect new log files matching wildcard patterns, avoiding the latency and CPU waste of pure polling.
Unique File Identification : Inode numbers are combined with a SHA‑256 hash of the first 128 bytes of the file to form a unique identifier, preventing duplicate or missed collections even when file names are reused.
Log Reading : Real‑time log lines are read using RandomAccessFile , which allows the pointer to resume from the last processed offset after restarts or failures.
Breakpoint‑Resume : The agent records the current read position and file signature to a local JSON checkpoint file every few seconds; on restart it restores the pointer to continue without data loss.
Data Transmission : Collected logs are sent directly to Kafka via a Netty‑based client, optionally passing through a Bees‑bus component for aggregation, load balancing, and cross‑region failover.
Offline Collection : For batch ingestion, the agent writes log files to HDFS using FSDataOutputStream , with rate‑limiting to avoid network spikes during peak hours.
Resource Management : CPU affinity (TaskSet), JVM heap tuning, disk I/O throttling, and network bandwidth monitoring ensure the agent coexists peacefully with business workloads.
Self‑Monitoring : The agent also collects its own log4j output, forwarding it through the same pipeline to Elasticsearch/Kibana for visibility.
Platform Management : A centralized web console provides heartbeat monitoring, task deployment, start/stop control, and rate‑limit configuration for tens of thousands of agent instances.
Comparison with Open‑Source Agents : Compared to Flume, Bees reduces memory usage (no channel, JVM as low as 64 MB), improves latency (inotify vs. polling), offers precise file identification (inode + signature), thread isolation per topic, graceful shutdown, richer metrics, and extensive customization.
Overall, the Bees agent has been in production since 2019, serving millions of log files and petabytes of data daily, demonstrating the effectiveness of its design in large‑scale big‑data environments.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.