Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide
This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.
Flume NG is a distributed, reliable, and highly available system for efficiently collecting, aggregating, and moving massive log data from diverse sources to a centralized storage system. The NG version is a lightweight tool that supports failover and load balancing.
Key Architectural Concepts
Event : a data unit with an optional header.
Flow : an abstract representation of an Event’s migration from source to destination.
Client : operates at the source side to send Events to a Flume Agent.
Agent : an independent Flume process containing a Source, a Channel, and a Sink.
Source : consumes Events generated by external systems.
Channel : a temporary storage that holds Events passed from the Source.
Sink : reads Events from a Channel and forwards them to the next Agent or final storage (e.g., HDFS).
The typical data flow is: external system → Source → Channel → Sink → storage (e.g., HDFS).
Typical Flow Configurations
Multiple agents connected sequentially.
Multiple agents aggregating into a single downstream agent.
Multiplexing agents using a selector for replication or routing based on header values.
Load‑balancing Sink Processor that distributes Events from a Channel to several Sinks.
Failover Sink Processor that maintains a priority list of Sinks and switches when a Sink becomes unavailable.
Basic Functionalities
Flume NG supports a wide range of Source, Channel, and Sink types.
Source Types
Source Type
Description
Avro Source
Built‑in support for Avro RPC.
Thrift Source
Built‑in support for Thrift protocol.
Exec Source
Executes a Unix command and reads its standard output as Events.
JMS Source
Reads messages from a JMS broker (e.g., ActiveMQ).
Spooling Directory Source
Monitors a directory for new files.
Twitter 1% Firehose Source
Streams a sample of Twitter data via API.
Netcat Source
Listens on a port and treats each line as an Event.
Sequence Generator Source
Generates sequential data.
Syslog Source
Consumes syslog data over UDP/TCP.
HTTP Source
Accepts HTTP POST/GET requests (JSON, BLOB).
Legacy Sources
Compatibility with Flume OG sources.
Channel Types
Channel Type
Description
Memory Channel
Stores Events in memory.
JDBC Channel
Persists Events in a relational database (Derby supported).
File Channel
Persists Events to disk files.
Spillable Memory Channel
Hybrid memory‑disk storage; experimental.
Pseudo Transaction Channel
Used for testing.
Custom Channel
User‑defined implementation.
Sink Types
Sink Type
Description
HDFS Sink
Writes data to HDFS.
Logger Sink
Writes data to log files.
Avro Sink
Converts Events to Avro and sends via RPC.
Thrift Sink
Converts Events to Thrift and sends via RPC.
IRC Sink
Replays data on IRC.
File Roll Sink
Writes data to local file system with rolling.
Null Sink
Discards all data.
HBase Sink
Writes data to HBase.
Morphline Solr Sink
Sends data to Solr clusters.
ElasticSearch Sink
Sends data to Elasticsearch clusters.
Kite Dataset Sink
Writes data to Kite Dataset (experimental).
Custom Sink
User‑defined implementation.
Additional components such as Channel Selectors, Sink Processors, Event Serializers, and Interceptors are also available.
Practical Application
Installation is straightforward; the article demonstrates using Flume NG version 1.5.0.1. Example configurations (all using a Memory Channel for simplicity) include:
Avro Source + Memory Channel + Logger Sink
Avro Source + Memory Channel + HDFS Sink
Spooling Directory Source + Memory Channel + HDFS Sink
Exec Source + Memory Channel + File Roll Sink
Each example shows how to edit the corresponding flume‑conf*.properties file, start the Agent, and send data using an Avro client or command‑line tool. The results are verified by checking logs, HDFS directories, or local file system paths.
The article concludes by encouraging readers to consult the official Flume user manual for more detailed configuration options.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.