Design and Evaluation of Log Collection Agents: Flume vs Filebeat
This article analyses the shortcomings of traditional log‑collection agents, compares Flume and Filebeat based on low‑cost, stability, efficiency and lightweight criteria, and presents practical solutions for file discovery, offset tracking, multi‑line handling and performance tuning in modern logging pipelines.
Log collection agents are often a black box for users of logging platforms, with hidden behaviours and design flaws that can affect reliability and performance.
Background
Log agents run on host servers to continuously gather log data and forward it downstream. Because they are the sole source of data for the logging platform, any failure can impact alerts, queries, or even the host applications themselves.
Agent Selection Criteria
The priorities for an agent, from high to low, are: low resource consumption > stability > efficiency > lightweight. These principles guided the evolution of the agent solution.
Initial Solution: Flume
The first version used Apache Flume as the agent due to MVP considerations: quick rollout, reuse of an existing Flume‑based system, and compatibility with historical pipelines (e.g., writing to a Beijing Kafka cluster).
Switching to Filebeat
After benchmarking, Filebeat was preferred because it met the low‑cost, stable, and efficient requirements while offering better extensibility for file‑based log sources.
Performance test results showed Filebeat using less than 3% memory, peak CPU under 70%, whereas Flume averaged 145% CPU, confirming the advantage of Filebeat.
How Agents Discover Log Files
Three common discovery methods are used:
User‑provided configuration (simple but cannot handle rotating logs).
Regular‑expression matching (flexible but potentially CPU‑intensive).
Placeholder pattern matching (efficient for predictable naming schemes).
The platform currently uses placeholder matching, with legacy regex support slated for removal.
Detecting New Files
Polling directories is simple but either too slow or too CPU‑heavy. Filebeat improves this by leveraging OS‑level notifications:
+ linux:inotify
+ macos:fsevents
+ windows:ReadDirectoryChangesWThese notifications are supplemented by a longer‑interval poll to work around kernel bugs, achieving low cost and high efficiency.
Identifying Files Uniquely
Using only file paths is unreliable because files can be renamed. A more robust identifier combines device ID and inode. Flume adds the MD5 of the first line, while Filebeat relies solely on device+inode for faster checks, accepting a small risk of misidentification.
Tracking Collection Offsets
A checkpoint file records each log file’s path (or identifier) and the last read offset, enabling recovery after crashes. However, if a file is renamed, the checkpoint may miss it, prompting the need for a more stable identifier.
Detecting Log Completion
Agents consider a file finished when EOF is reached, but if new data arrives after EOF, the agent must continue reading. Both Flume and Filebeat sort files by modification time and poll their status, with Filebeat adding a stat‑time comparison before opening the file.
Handling Multi‑Line Logs
Single‑line collection fails for stack traces, JSON, or SQL statements. Flume uses a custom plugin that treats lines not starting with a configured prefix as a continuation. Filebeat offers a similar feature with the negate option and additional settings for maximum lines and timeout to avoid memory issues.
Both agents face challenges with the final line of a multi‑line event: Flume may send incomplete data at EOF, while Filebeat may hold the file handle indefinitely, potentially never emitting the last line.
Overall, the article highlights the trade‑offs in agent design, emphasizing low‑cost, stable, and efficient solutions, and demonstrates why Filebeat is often a better fit for modern log collection needs.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.