Big Data 17 min read

Master Prometheus Monitoring for Big Data on Kubernetes: Design & Alerting

This article explains how to design and implement a Prometheus‑based monitoring system for big‑data components running on Kubernetes, covering metric exposure methods, scrape configurations, exporter deployment, and dynamic alert rule management with Alertmanager.

Efficient Ops
Efficient Ops
Efficient Ops
Master Prometheus Monitoring for Big Data on Kubernetes: Design & Alerting

Design Overview

The monitoring system for big‑data platforms must reliably scrape exposed metrics, analyze them, and generate alerts. Key questions include what to monitor, how metrics are exposed, how Prometheus scrapes them, and how alert rules are dynamically configured.

Monitoring Targets

All big‑data components run as pods in a Kubernetes cluster.

Metric Exposure Methods

Directly expose Prometheus metrics (pull).

Push metrics to a

pushgateway

(push).

Use a custom exporter to convert other formats to Prometheus‑compatible metrics.

Some components, such as Flink on YARN, run inside YARN containers and therefore require the pushgateway approach; short‑lived components are also recommended to push metrics.

Scrape Configuration

Prometheus always pulls metrics from targets. Common scrape jobs include:

Native

Job

configuration.

PodMonitor

(via Prometheus Operator) for pod‑level metrics.

ServiceMonitor

(via Prometheus Operator) for service‑level metrics.

When running on Kubernetes,

PodMonitor

is usually the simplest choice.

<code>annotations:
  prometheus.io/scrape: "true"
  prometheus.io/scheme: "http"
  prometheus.io/path: "/metrics"
  prometheus.io/port: "19091"
</code>

The main selectors in

prometheus-prometheus.yaml

are

serviceMonitorSelector

,

podMonitorSelector

,

ruleSelector

, and

alertmanagers

. The

kubernetes_sd_config

with relabeling can discover pods dynamically and rewrite labels before scraping.

<code>labels:
  bigData.metrics.object: pod
annotations:
  bigData.metrics/scrape: "true"
  bigData.metrics/scheme: "https"
  bigData.metrics/path: "/jmx"
  bigData.metrics/port: "29871"
  bigData.metrics/role: "hdfs-nn,common"
</code>

Alert Design

Alert Flow

Service experiences an abnormal condition.

Prometheus generates an alert.

Alertmanager receives the alert.

Alertmanager processes the alert according to configured routing, grouping, and inhibition rules, then forwards it (e.g., via webhook, SMS, email).

Dynamic Alert Configuration

Alerting consists of two parts:

alertmanager

: handling strategy (receivers, routing).

alertRule

: concrete alert expressions.

Alertmanager Example

<code>global:
  resolve_timeout: 5m
receivers:
  - name: 'default'
  - name: 'test.web.hook'
    webhook_configs:
      - url: 'http://alert-url'
route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  group_by: [groupId,instanceId]
  routes:
    - receiver: 'test.web.hook'
      continue: true
      match:
        groupId: node-disk-usage
    - receiver: 'test.web.hook'
      continue: true
      match:
        groupId: kafka-topic-highstore
</code>

AlertRule Example – Disk Usage

<code>apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-disk-usage
  namespace: monitoring
spec:
  groups:
    - name: node-disk-usage
      rules:
        - alert: node-disk-usage
          expr: 100*(1-node_filesystem_avail_bytes{mountpoint="${path}"}/node_filesystem_size_bytes{mountpoint="${path}"}) > ${thresholdValue}
          for: 1m
          labels:
            groupId: node-disk-usage
            userIds: super
            receivers: SMS
          annotations:
            title: "Disk warning: node {{$labels.instance}} ${path} usage {{$value}}%"
            content: "Disk warning: node {{$labels.instance}} ${path} usage {{$value}}%"
</code>

AlertRule Example – Kafka Lag

<code>apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-topic-highstore-${uniqueName}
  namespace: monitoring
spec:
  groups:
    - name: kafka-topic-highstore
      rules:
        - alert: kafka-topic-highstore-${uniqueName}
          expr: sum(kafka_consumergroup_lag{exporterType="kafka",consumergroup="${consumergroup}"}) > ${thresholdValue}
          for: 1m
          labels:
            groupId: kafka-topic-highstore
            instanceId: ${uniqueName}
            userIds: super
            receivers: SMS
          annotations:
            title: "KAFKA warning: consumer group ${consumergroup} lag {{$value}}"
            content: "KAFKA warning: consumer group ${consumergroup} lag {{$value}}"
</code>

Alert Timing Example

Two nodes (node1, node2) are monitored for disk usage. Alerts are grouped by

groupId

, causing repeated alerts to follow

group_wait

,

group_interval

, and

repeat_interval

semantics.

for : duration a metric must be abnormal before the alert fires.

group_wait : initial wait after a new group is created.

group_interval : interval between alerts when the group composition changes.

repeat_interval : interval between identical alerts when the group does not change (including recovery alerts).

Exporter Deployment

Exporters can run as sidecars (1:1 with the target pod) or as independent services (1:1 or 1:many). Sidecars bind the exporter lifecycle to the target, while independent deployments reduce coupling and are more flexible for multi‑node services such as Kafka.

Additional Tools

Use

promtool

to validate metric formats (e.g., ensure metric names and label names contain no dots). Port‑forwarding can expose Prometheus, Grafana, and Alertmanager for external access:

<code># Prometheus UI
nohup kubectl port-forward --address 0.0.0.0 service/prometheus-k8s 19090:9090 -n monitoring &amp;
# Grafana UI
nohup kubectl port-forward --address 0.0.0.0 service/grafana 13000:3000 -n monitoring &amp;
# Alertmanager UI
nohup kubectl port-forward --address 0.0.0.0 service/alertmanager-main 9093:9093 -n monitoring &amp;
</code>
kubernetesPrometheusAlertmanagerExporterAlert RulesBig Data MonitoringPodMonitor
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.