Introduction to Apache Flume: Architecture, Core Concepts, Configuration and Usage
This article provides a comprehensive overview of Apache Flume, covering its design goals, core components, deployment architecture, configuration patterns, and step‑by‑step instructions for integrating Flume with Zookeeper and Kafka to collect and forward massive log data.
Flume is a distributed, reliable, and highly available system for massive log aggregation that allows users to define custom data sources, perform simple processing, and write events to various configurable sinks.
Design goals: (1) Reliability – three levels of guarantee (end‑to‑end, store‑on‑failure, best‑effort); (2) Scalability – three‑tier architecture (agent, collector, storage) with horizontal scaling and ZooKeeper‑managed masters; (3) Manageability – unified master control, web and shell management interfaces; (4) Extensibility – pluggable agents, collectors, and storages with many built‑in components.
Core concepts: Agent (runs on a JVM, may contain multiple sources and sinks), Client (produces data), Source (collects data from clients), Sink (consumes data from channels), Channel (queues events between sources and sinks), and Event (basic data payload).
Typical deployment involves downloading the latest Flume package (e.g., 1.6.0) from flume.apache.org , extracting it, and preparing configuration files.
Common configuration patterns include:
Pattern 1 – scanning a specific file.
Pattern 2 – (image omitted for brevity).
Pattern 3 – scanning a directory for new files.
For this tutorial the first pattern is used to integrate Flume with Kafka.
Before starting Flume, ensure ZooKeeper and Kafka are running:
./zkServer.sh startStart Kafka (example command):
.fka-server-start.sh -daemon ../configrver.propertiesConsume messages with the default Kafka console consumer:
.fka-console-consumer.sh -zookeeper localhost:2181 --from-beginning --topic testKJ1Generate test log data with a simple shell script (output.sh):
for((i=0;i<=50000;i++));
do echo "test-"$i>>abc.log;
doneRun the script, then start Flume with the following command:
./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,consoleThe final log line "Component type:SINK,name:k1 started" indicates successful startup.
The complete workflow is illustrated in the following diagram:
In summary, this guide introduces the basics of Flume, its architecture, configuration, and integration with Kafka, laying the groundwork for deeper source‑code analysis and advanced tuning.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.