Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices
This article provides a comprehensive overview of Flume NG, covering its architecture, core components (source, channel, sink), reliability mechanisms, common deployment scenarios, installation steps, configuration examples, compilation instructions, and practical best‑practice recommendations for building robust log‑collection pipelines.
Flume NG is a distributed, reliable, and available system from Cloudera that efficiently collects, aggregates, and transports massive log data from diverse sources to a centralized storage system, running on Java 1.6 or higher.
The architecture revolves around three core concepts: Event (a data unit with optional headers), Flow (the abstract movement of events), and the Agent, which contains a Source, a Channel, and a Sink. Sources ingest data (e.g., Avro, log4j, syslog, HTTP), Channels temporarily store events (Memory, MemoryRecover, File), and Sinks deliver events to destinations such as HDFS, HBase, or files.
Flume guarantees reliability through transactional semantics: an event is removed from a channel only after it has been successfully persisted by a sink, ensuring end‑to‑end data integrity even across multiple agents.
Typical deployment patterns include sequentially chained agents, fan‑in aggregation of many agents into a single collector, and multiplexing or replication agents for load‑balancing and failover. Configuration snippets illustrate how to define sources, channels, sinks, and selector types (replicating or multiplexing) within the .conf file.
Installation is straightforward via RPM, after which agents can be started with commands such as:
$ flume-ng agent -c /etc/flume-ng/conf -f /etc/flume-ng/conf/f1.conf -Dflume.root.logger=DEBUG,console -n agent-1Example configurations demonstrate using an Avro source, a SpoolDir source, and writing to HDFS or HBase. Each example includes the full agent definition, channel type (memory or file), source settings, and sink parameters, with command‑line options explained (e.g., -n for agent name, -c for config directory).
For developers, the source code can be compiled from GitHub using Maven, with notes on handling missing dependencies and adding custom repositories. The article also lists best‑practice recommendations such as naming conventions (src‑, ch‑, sink‑ prefixes), using Avro for inter‑module communication, three‑tier deployment (Agent, Collector, Store), extending channels for dual‑throughput, and monitoring channel congestion and HDFS write performance.
Additional resources include links to official Flume documentation, example projects, and a GitHub repository containing enhancements made by Meituan.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.