Big Data 6 min read

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

The article details Snowball's challenges with a saturated CDH Hadoop cluster, outlines the limitations of the original Kafka‑based log pipeline, and explains how a comprehensive redesign using FlumeNG, Spillable Memory Channels, and custom HDFS sinks resolves latency, data loss, and high‑load issues while supporting future growth.

Snowball Engineer Team

Mar 23, 2018

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

Snowball's existing CDH Hadoop cluster was nearing storage and compute saturation, prompting a migration to a new CDH cluster and a simultaneous upgrade of the log collection system.

The original log pipeline, deployed in 2015, relied on Kafka to transport data directly to HDFS. While simple and leveraging Kafka's ISR for availability, it suffered from client version risks, lack of recovery tools, high availability costs, limited output targets, and insufficient monitoring.

Rapid data growth (approximately three times the 2015 volume by 2017) made the existing architecture untenable, leading to a complete redesign.

The new pipeline replaces Kafka with FlumeNG for cost‑effective transport, routes offline data straight to HDFS, Hive, and HBase, and filters unnecessary logs before forwarding real‑time data to Kafka for Flink and Spark processing.

Technical choices for the redesign include:

FlumeNG's rich plugin ecosystem and comprehensive monitoring.

Java‑based development enabling easy customization and debugging.

Support for hot deployment and ZooKeeper‑managed configuration.

Key improvements address previous issues:

Use of Spillable Memory Channel provides high availability with disk‑backed buffering.

Segregating offline and real‑time data reduces Kafka load and directs bulk data to HDFS/Hive/HBase.

Interceptors split business‑specific logs at the source, saving network I/O and CPU.

Support for multiple data sources (e.g., Avro, Syslog) facilitates gradual migration from Kafka.

Stateless FlumeNG nodes simplify scaling and restarts compared to Kafka.

Additional custom development includes a rewritten HDFS sink that removes idle‑time checks, reduces lock granularity, and batches FD‑grouped writes for better CPU and network utilization.

A Logback appender was also built to send logs to FlumeNG, replacing the previous Log4j‑only solution and improving HA by buffering logs locally with WatchService instead of BDB.

The article concludes that systematic problem analysis, leveraging existing open‑source solutions, and targeted customizations are essential for effective infrastructure upgrades, and invites interested engineers to apply for Snowball's open positions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Kafka log collection Cluster Migration Hadoop FlumeNG

Written by

Snowball Engineer Team

Proactivity, efficiency, professionalism, and empathy are the core values of the Snowball Engineer Team; curiosity, passion, and sharing of technology drive their continuous progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.