Operations 12 min read

Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana

This article reviews common open‑source monitoring combinations, compares their strengths and weaknesses, and shares practical guidance on selecting collectors, storage back‑ends, and visualization tools such as Telegraf, InfluxDB, Prometheus, Grafana, and alertmanager for large‑scale data platform operations.

Efficient Ops
Efficient Ops
Efficient Ops
Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana

Popular Monitoring Tool Choices

Common monitoring stacks include:

Nagios+Ganglia

Zabbix

Telegraf (or collect) + InfluxDB or Prometheus or Elasticsearch + Grafana + Alertmanager

Nagios, Ganglia, and Zabbix are early‑stage open‑source solutions, while Grafana and Prometheus are newer and more extensible.

Nagios+Ganglia

Nagios, originally released in 1999 as “NetSaint”, monitors network services and host resources on Linux/Unix. It relies on custom shell scripts for alerts but lacks auto‑discovery, has a cumbersome configuration model, weak time‑series storage, and limited historical data.

Ganglia, started by UC Berkeley, measures thousands of nodes via gmond, gmetad, and a web front‑end. It tracks CPU, memory, disk, I/O, and network metrics, but its monitoring scope is limited and custom configuration can be complex.

Zabbix

Zabbix is easy to start with and provides basic monitoring, but deep customization requires extensive development. It generates many alerts unless filtered, and setting up custom item alerts can be tedious.

Telegraf/Collect + InfluxDB or Prometheus or Elasticsearch + Grafana + Alertmanager

This modular stack leverages the strengths of each component: flexible data collection, scalable storage, rich visualization, and robust alerting. The trade‑off is higher integration effort and the need for strong operational expertise to choose the right combination.

Practical Experience

1. Collector Selection

After evaluating collect, Telegraf, and jmxtrans, Telegraf was chosen for its stability, active community, and Go‑based plugin system. It can write metrics to InfluxDB, Prometheus, Elasticsearch, etc.

2. Database Selection

Initially InfluxDB was used, but scaling to thousands of servers caused read/write timeouts. Adjusting retention policies can help:

<code>ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 72h REPLICATION 1 SHARD DURATION 24h DEFAULT</code>

Key parameters:

duration – data retention period (0 means unlimited)

shardGroupDuration – storage interval for shards; longer intervals reduce query efficiency

replication – number of replicas

default – whether this policy is the default

Because the open‑source InfluxDB lacks stable distributed support, the author later switched to Elasticsearch or a Prometheus federation for large‑scale monitoring.

2. Grafana Visualization Tips

Grafana integrates with many databases and offers ready‑made JSON dashboards for rapid data presentation.

Host Monitoring Items : kernel, memory, load, disk I/O, network, inode usage, process and thread counts.

Host classification separates cluster nodes from interface machines, simplifying fault isolation.

Top‑10 Host Resource Usage : ranks hosts by CPU, memory, load, and thread metrics across defined host groups.

Top‑10 Process Resource Usage : lists processes with highest CPU and memory consumption, providing start time, PID, and usage percentages to aid root‑cause analysis.

Platform Monitoring Items

Key platform services (HDFS, YARN, Zookeeper, Kafka, Storm, Spark, HBase) are monitored at service, role, and instance levels, generating tens of thousands of metrics.

YARN queue resource usage visualizations

Zeeplin operation logs for audit

HDFS directory file counts and storage profiling

Cluster NameNode RPC latency and request volume analysis

Analyzing NameNode and HDFS audit logs helps identify excessive RPC traffic caused by poorly designed jobs, enabling targeted optimizations.

Daily Production Monitoring

Visual dashboards provide quick insight into daily production metrics, helping operators locate performance bottlenecks and production delays.

Conclusion : This article covered monitoring tool selection, collector and storage choices, and visualization techniques. Future posts will discuss alerting strategies, unified collection templates, and automated recovery workflows.

monitoringoperationsPrometheusInfluxDBGrafanaZabbixnagios
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.