Operations 15 min read

Centralized Log Collection for Distributed Docker Services Using Flume and Kafka

This article presents a practical architecture for centrally collecting dispersed logs from Docker‑based services in a distributed environment by leveraging Flume NG as a non‑intrusive log agent, Kafka as a high‑throughput message bus, and custom sinks to partition logs by service, module, and day.

Architect
Architect
Architect
Centralized Log Collection for Distributed Docker Services Using Flume and Kafka

1 Background and Problem

With the widespread adoption of cloud computing, PaaS platforms, virtualization, and containerization (e.g., Docker), many services run in the cloud where traditional SSH or FTP access to logs is no longer feasible. Engineers still need logs for monitoring, analysis, and troubleshooting, especially during deployments that are performed through GUI‑driven PaaS consoles. This article proposes a method to aggregate scattered logs from containerized services in a distributed environment.

2 Design Constraints and Requirements

2.1 Application Scenario

The solution must handle logs generated by hundreds of servers, each log entry smaller than 1 KB and up to 50 KB, with a total daily volume under 500 GB.

2.2 Functional Requirements

Collect all service logs centrally.

Distinguish logs by service, module, and day granularity.

2.3 Non‑Functional Requirements

Non‑intrusive: the collector runs as an independent process with controllable resource usage.

Low latency: end‑to‑end delay from log generation to central storage must be < 4 seconds.

Persistence: retain logs for the most recent N days.

Acceptable loss: occasional loss is tolerable as long as the loss rate stays below a defined threshold (e.g., 0.01%).

Ordering: strict ordering is not required.

Availability: the collector is an offline function with a target of three‑nines (99.9%) yearly uptime.

3 Implementation Architecture

The architecture consists of three layers: Producer, Broker, and Consumer.

3.1 Producer Layer

Each Docker container runs a Flume NG agent that reads logs via tail -F (the source) and forwards them to a Kafka sink. This design keeps the collector separate from the application process, satisfying the non‑intrusive requirement.

3.2 Broker Layer

Multiple Flume NG agents publish logs to a Kafka cluster. Kafka provides high throughput, partitioned storage, and replication (configured with a replication factor of 2 and four partitions) to meet performance and durability goals.

3.3 Consumer Layer

A second Flume NG agent consumes the Kafka topics. It uses a custom RollingByTypeAndDayFileSink (available on GitHub) to write logs to files, partitioned by service/module name and date.

4 Practical Implementation

4.1 Container Configuration

Dockerfile

The Docker image includes the base runtime and Flume binaries. The entry point uses supervisord to keep both the application and Flume processes alive.

FROM ${BASE_IMAGE}
MAINTAINER ${MAINTAINER}
ENV REFRESH_AT ${REFRESH_AT}
RUN mkdir -p /opt/${MODULE_NAME}
ADD ${PACKAGE_NAME} /opt/${MODULE_NAME}/
COPY service.supervisord.conf /etc/supervisord.conf.drvice.supervisord.conf
COPY supervisor-msoa-wrapper.sh /opt/${MODULE_NAME}/supervisor-msoa-wrapper.sh
RUN chmod +x /opt/${MODULE_NAME}/supervisor-msoa-wrapper.sh
RUN chmod +x /opt/${MODULE_NAME}/*.sh
ENTRYPOINT ["/usr/bin/supervisord", "-c", "/etc/supervisord.conf"]

supervisord Configuration

[program:${MODULE_NAME}]
command=/opt/${MODULE_NAME}/supervisor-msoa-wrapper.sh

supervisor-msoa-wrapper.sh

The script starts the application, launches Flume NG, and uses wait to keep the container alive.

#!/bin/bash
function shutdown() {
  date
  echo "Shutting down Service"
  unset SERVICE_PID
  cd /opt/${MODULE_NAME}
  source stop.sh
}

# Stop process
cd /opt/${MODULE_NAME}
echo "Stopping Service"
source stop.sh

# Start process
echo "Starting Service"
source start.sh
export SERVICE_PID=$!

# Start Flume NG agent after logs are generated
sleep 4
nohup /opt/apache-flume-1.6.0-bin/bin/flume-ng agent \
  --conf /opt/apache-flume-1.6.0-bin/conf \
  --conf-file /opt/apache-flume-1.6.0-bin/conf/logback-to-kafka.conf \
  --name a1 -Dflume.root.logger=INFO,console &

trap shutdown HUP INT QUIT ABRT KILL ALRM TERM TSTP

echo "Waiting for $SERVICE_PID"
wait $SERVICE_PID

Flume Configuration

A custom StaticLinePrefixExecSource adds a fixed prefix (service and module name) to each log line before it reaches the sink.

Example transformed log line:

service1##$$##m1-ocean-1004.cp  [INFO] 2016-03-18 12:59:31,080 [main] fountain.runner.CustomConsumerFactoryPostProcessor (CustomConsumerFactoryPostProcessor.java:91) -Start to init IoC container by loading XML bean definitions from classpath:fountain-consumer-stdout.xml

The channel uses an in‑memory queue; the sink is a Kafka sink with batch size of 5 events (adjustable for higher concurrency).

4.2 Broker Configuration

Create a Kafka topic named keplerlog with replication factor 2 and 4 partitions:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 4 --topic keplerlog

Produce test messages:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic keplerlog

Consume to verify:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic keplerlog --from-beginning

4.3 Centralized Log Reception

The consumer Flume agent uses KafkaSource to pull logs, an in‑memory channel, and the custom RollingByTypeAndDayFileSink to write files named {module}.{YYYYMMDD} (e.g., portal.20150606 ).

~/data/kepler-log$ ls
authorization.20160512
default.20160513
default.20160505
portal.20160512
portal.20160505
portal.20160514

Two Notable Pitfalls

Pitfall 1

When using a custom StaticLinePrefixExecSource , the default KafkaSource discards custom headers, so the prefix must be embedded in the event body (using the ##$$## delimiter) for the sink to extract the module name.

Pitfall 2

Wrapping tail -F in a script that pipes through other commands (e.g., awk ) can cause the exec source to drop recent lines, leading to delayed log delivery. Directly invoking tail -F without extra piping avoids this issue.

5 Conclusion

The presented solution demonstrates how open‑source components such as Flume and Kafka can be combined to build a reliable, low‑latency log collection pipeline for distributed Docker services. Beyond collection, the aggregated logs enable downstream analytics like Spark Streaming for traffic control, awk/Scala for ad‑hoc statistics, Hadoop/Spark for big‑data processing, and ELK for searchable dashboards.

distributed systemsDockerkafkalog collectionflume
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.