Big Data 6 min read

Introduction to Apache Flume: Architecture, Core Concepts, Configuration and Usage

This article provides a comprehensive overview of Apache Flume, covering its design goals, core components, deployment architecture, configuration patterns, and step‑by‑step instructions for integrating Flume with Zookeeper and Kafka to collect and forward massive log data.

Architect

Mar 21, 2016

Flume is a distributed, reliable, and highly available system for massive log aggregation that allows users to define custom data sources, perform simple processing, and write events to various configurable sinks.

Design goals: (1) Reliability – three levels of guarantee (end‑to‑end, store‑on‑failure, best‑effort); (2) Scalability – three‑tier architecture (agent, collector, storage) with horizontal scaling and ZooKeeper‑managed masters; (3) Manageability – unified master control, web and shell management interfaces; (4) Extensibility – pluggable agents, collectors, and storages with many built‑in components.

Core concepts: Agent (runs on a JVM, may contain multiple sources and sinks), Client (produces data), Source (collects data from clients), Sink (consumes data from channels), Channel (queues events between sources and sinks), and Event (basic data payload).

Typical deployment involves downloading the latest Flume package (e.g., 1.6.0) from flume.apache.org , extracting it, and preparing configuration files.

Common configuration patterns include:

Pattern 1 – scanning a specific file.

Pattern 2 – (image omitted for brevity).

Pattern 3 – scanning a directory for new files.

For this tutorial the first pattern is used to integrate Flume with Kafka.

Before starting Flume, ensure ZooKeeper and Kafka are running: ./zkServer.sh start Start Kafka (example command): .fka-server-start.sh -daemon ../configrver.properties Consume messages with the default Kafka console consumer:

.fka-console-consumer.sh -zookeeper localhost:2181 --from-beginning --topic testKJ1

Generate test log data with a simple shell script (output.sh):

for((i=0;i<=50000;i++));
 do echo "test-"$i>>abc.log;
 done

Run the script, then start Flume with the following command:

./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console

The final log line "Component type:SINK,name:k1 started" indicates successful startup.

The complete workflow is illustrated in the following diagram:

In summary, this guide introduces the basics of Flume, its architecture, configuration, and integration with Kafka, laying the groundwork for deeper source‑code analysis and advanced tuning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ZooKeeper Kafka log aggregation Apache Flume

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.