Big Data 16 min read

Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices

This article provides a comprehensive overview of Flume NG, covering its architecture, core components (source, channel, sink), reliability mechanisms, common deployment scenarios, installation steps, configuration examples, compilation instructions, and practical best‑practice recommendations for building robust log‑collection pipelines.

Architect

Apr 10, 2016

Flume NG is a distributed, reliable, and available system from Cloudera that efficiently collects, aggregates, and transports massive log data from diverse sources to a centralized storage system, running on Java 1.6 or higher.

The architecture revolves around three core concepts: Event (a data unit with optional headers), Flow (the abstract movement of events), and the Agent, which contains a Source, a Channel, and a Sink. Sources ingest data (e.g., Avro, log4j, syslog, HTTP), Channels temporarily store events (Memory, MemoryRecover, File), and Sinks deliver events to destinations such as HDFS, HBase, or files.

Flume guarantees reliability through transactional semantics: an event is removed from a channel only after it has been successfully persisted by a sink, ensuring end‑to‑end data integrity even across multiple agents.

Typical deployment patterns include sequentially chained agents, fan‑in aggregation of many agents into a single collector, and multiplexing or replication agents for load‑balancing and failover. Configuration snippets illustrate how to define sources, channels, sinks, and selector types (replicating or multiplexing) within the .conf file.

Installation is straightforward via RPM, after which agents can be started with commands such as:

$ flume-ng agent -c /etc/flume-ng/conf -f /etc/flume-ng/conf/f1.conf -Dflume.root.logger=DEBUG,console -n agent-1

Example configurations demonstrate using an Avro source, a SpoolDir source, and writing to HDFS or HBase. Each example includes the full agent definition, channel type (memory or file), source settings, and sink parameters, with command‑line options explained (e.g., -n for agent name, -c for config directory).

For developers, the source code can be compiled from GitHub using Maven, with notes on handling missing dependencies and adding custom repositories. The article also lists best‑practice recommendations such as naming conventions (src‑, ch‑, sink‑ prefixes), using Avro for inter‑module communication, three‑tier deployment (Agent, Collector, Store), extending channels for dual‑throughput, and monitoring channel congestion and HDFS write performance.

Additional resources include links to official Flume documentation, example projects, and a GitHub repository containing enhancements made by Meituan.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data configuration Apache data ingestion

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.