Big Data 6 min read

Introduction to Apache Flume: Architecture, Core Concepts, Configuration and Usage

This article provides a comprehensive overview of Apache Flume, covering its design goals, core components, deployment architecture, configuration patterns, and step‑by‑step instructions for integrating Flume with Zookeeper and Kafka to collect and forward massive log data.

Architect
Architect
Architect
Introduction to Apache Flume: Architecture, Core Concepts, Configuration and Usage

Flume is a distributed, reliable, and highly available system for massive log aggregation that allows users to define custom data sources, perform simple processing, and write events to various configurable sinks.

Design goals: (1) Reliability – three levels of guarantee (end‑to‑end, store‑on‑failure, best‑effort); (2) Scalability – three‑tier architecture (agent, collector, storage) with horizontal scaling and ZooKeeper‑managed masters; (3) Manageability – unified master control, web and shell management interfaces; (4) Extensibility – pluggable agents, collectors, and storages with many built‑in components.

Core concepts: Agent (runs on a JVM, may contain multiple sources and sinks), Client (produces data), Source (collects data from clients), Sink (consumes data from channels), Channel (queues events between sources and sinks), and Event (basic data payload).

Typical deployment involves downloading the latest Flume package (e.g., 1.6.0) from flume.apache.org , extracting it, and preparing configuration files.

Common configuration patterns include:

Pattern 1 – scanning a specific file.

Pattern 2 – (image omitted for brevity).

Pattern 3 – scanning a directory for new files.

For this tutorial the first pattern is used to integrate Flume with Kafka.

Before starting Flume, ensure ZooKeeper and Kafka are running:

./zkServer.sh start

Start Kafka (example command):

.fka-server-start.sh -daemon ../configrver.properties

Consume messages with the default Kafka console consumer:

.fka-console-consumer.sh -zookeeper localhost:2181 --from-beginning --topic testKJ1

Generate test log data with a simple shell script (output.sh):

for((i=0;i<=50000;i++));
 do echo "test-"$i>>abc.log;
 done

Run the script, then start Flume with the following command:

./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console

The final log line "Component type:SINK,name:k1 started" indicates successful startup.

The complete workflow is illustrated in the following diagram:

In summary, this guide introduces the basics of Flume, its architecture, configuration, and integration with Kafka, laying the groundwork for deeper source‑code analysis and advanced tuning.

Big DataZookeeperKafkaData ingestionlog aggregationApache Flume
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.