Operations 13 min read

Design and Evaluation of Log Collection Agents: Flume vs Filebeat

This article analyses the shortcomings of traditional log‑collection agents, compares Flume and Filebeat based on low‑cost, stability, efficiency and lightweight criteria, and presents practical solutions for file discovery, offset tracking, multi‑line handling and performance tuning in modern logging pipelines.

Architect

Dec 23, 2020

Design and Evaluation of Log Collection Agents: Flume vs Filebeat

Log collection agents are often a black box for users of logging platforms, with hidden behaviours and design flaws that can affect reliability and performance.

Background

Log agents run on host servers to continuously gather log data and forward it downstream. Because they are the sole source of data for the logging platform, any failure can impact alerts, queries, or even the host applications themselves.

Agent Selection Criteria

The priorities for an agent, from high to low, are: low resource consumption > stability > efficiency > lightweight. These principles guided the evolution of the agent solution.

Initial Solution: Flume

The first version used Apache Flume as the agent due to MVP considerations: quick rollout, reuse of an existing Flume‑based system, and compatibility with historical pipelines (e.g., writing to a Beijing Kafka cluster).

Switching to Filebeat

After benchmarking, Filebeat was preferred because it met the low‑cost, stable, and efficient requirements while offering better extensibility for file‑based log sources.

Performance test results showed Filebeat using less than 3% memory, peak CPU under 70%, whereas Flume averaged 145% CPU, confirming the advantage of Filebeat.

How Agents Discover Log Files

Three common discovery methods are used:

User‑provided configuration (simple but cannot handle rotating logs).

Regular‑expression matching (flexible but potentially CPU‑intensive).

Placeholder pattern matching (efficient for predictable naming schemes).

The platform currently uses placeholder matching, with legacy regex support slated for removal.

Detecting New Files

Polling directories is simple but either too slow or too CPU‑heavy. Filebeat improves this by leveraging OS‑level notifications:

+ linux：inotify</code><code>+ macos：fsevents</code><code>+ windows：ReadDirectoryChangesW

These notifications are supplemented by a longer‑interval poll to work around kernel bugs, achieving low cost and high efficiency.

Identifying Files Uniquely

Using only file paths is unreliable because files can be renamed. A more robust identifier combines device ID and inode. Flume adds the MD5 of the first line, while Filebeat relies solely on device+inode for faster checks, accepting a small risk of misidentification.

Tracking Collection Offsets

A checkpoint file records each log file’s path (or identifier) and the last read offset, enabling recovery after crashes. However, if a file is renamed, the checkpoint may miss it, prompting the need for a more stable identifier.

Detecting Log Completion

Agents consider a file finished when EOF is reached, but if new data arrives after EOF, the agent must continue reading. Both Flume and Filebeat sort files by modification time and poll their status, with Filebeat adding a stat‑time comparison before opening the file.

Handling Multi‑Line Logs

Single‑line collection fails for stack traces, JSON, or SQL statements. Flume uses a custom plugin that treats lines not starting with a configured prefix as a continuation. Filebeat offers a similar feature with the negate option and additional settings for maximum lines and timeout to avoid memory issues.

Both agents face challenges with the final line of a multi‑line event: Flume may send incomplete data at EOF, while Filebeat may hold the file handle indefinitely, potentially never emitting the last line.

Overall, the article highlights the trade‑offs in agent design, emphasizing low‑cost, stable, and efficient solutions, and demonstrates why Filebeat is often a better fit for modern log collection needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Observability Agent Design log collection Flume

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.