Big Data 17 min read

Master Prometheus Monitoring for Big Data on Kubernetes: Design & Alerting

This article explains how to design and implement a Prometheus‑based monitoring system for big‑data components running on Kubernetes, covering metric exposure methods, scrape configurations, exporter deployment, and dynamic alert rule management with Alertmanager.

Efficient Ops

Oct 19, 2022

Master Prometheus Monitoring for Big Data on Kubernetes: Design & Alerting

Design Overview

The monitoring system for big‑data platforms must reliably scrape exposed metrics, analyze them, and generate alerts. Key questions include what to monitor, how metrics are exposed, how Prometheus scrapes them, and how alert rules are dynamically configured.

Monitoring Targets

All big‑data components run as pods in a Kubernetes cluster.

Metric Exposure Methods

Directly expose Prometheus metrics (pull).

Push metrics to a pushgateway (push).

Use a custom exporter to convert other formats to Prometheus‑compatible metrics.

Some components, such as Flink on YARN, run inside YARN containers and therefore require the pushgateway approach; short‑lived components are also recommended to push metrics.

Scrape Configuration

Prometheus always pulls metrics from targets. Common scrape jobs include:

Native Job configuration. PodMonitor (via Prometheus Operator) for pod‑level metrics. ServiceMonitor (via Prometheus Operator) for service‑level metrics.

When running on Kubernetes, PodMonitor is usually the simplest choice.

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/scheme: "http"
  prometheus.io/path: "/metrics"
  prometheus.io/port: "19091"

The main selectors in prometheus-prometheus.yaml are serviceMonitorSelector, podMonitorSelector, ruleSelector, and alertmanagers. The kubernetes_sd_config with relabeling can discover pods dynamically and rewrite labels before scraping.

labels:
  bigData.metrics.object: pod
annotations:
  bigData.metrics/scrape: "true"
  bigData.metrics/scheme: "https"
  bigData.metrics/path: "/jmx"
  bigData.metrics/port: "29871"
  bigData.metrics/role: "hdfs-nn,common"

Alert Design

Alert Flow

Service experiences an abnormal condition.

Prometheus generates an alert.

Alertmanager receives the alert.

Alertmanager processes the alert according to configured routing, grouping, and inhibition rules, then forwards it (e.g., via webhook, SMS, email).

Dynamic Alert Configuration

Alerting consists of two parts: alertmanager: handling strategy (receivers, routing). alertRule: concrete alert expressions.

Alertmanager Example

global:
  resolve_timeout: 5m
receivers:
  - name: 'default'
  - name: 'test.web.hook'
    webhook_configs:
      - url: 'http://alert-url'
route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  group_by: [groupId,instanceId]
  routes:
    - receiver: 'test.web.hook'
      continue: true
      match:
        groupId: node-disk-usage
    - receiver: 'test.web.hook'
      continue: true
      match:
        groupId: kafka-topic-highstore

AlertRule Example – Disk Usage

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-disk-usage
  namespace: monitoring
spec:
  groups:
    - name: node-disk-usage
      rules:
        - alert: node-disk-usage
          expr: 100*(1-node_filesystem_avail_bytes{mountpoint="${path}"}/node_filesystem_size_bytes{mountpoint="${path}"}) > ${thresholdValue}
          for: 1m
          labels:
            groupId: node-disk-usage
            userIds: super
            receivers: SMS
          annotations:
            title: "Disk warning: node {{$labels.instance}} ${path} usage {{$value}}%"
            content: "Disk warning: node {{$labels.instance}} ${path} usage {{$value}}%"

AlertRule Example – Kafka Lag

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-topic-highstore-${uniqueName}
  namespace: monitoring
spec:
  groups:
    - name: kafka-topic-highstore
      rules:
        - alert: kafka-topic-highstore-${uniqueName}
          expr: sum(kafka_consumergroup_lag{exporterType="kafka",consumergroup="${consumergroup}"}) > ${thresholdValue}
          for: 1m
          labels:
            groupId: kafka-topic-highstore
            instanceId: ${uniqueName}
            userIds: super
            receivers: SMS
          annotations:
            title: "KAFKA warning: consumer group ${consumergroup} lag {{$value}}"
            content: "KAFKA warning: consumer group ${consumergroup} lag {{$value}}"

Alert Timing Example

Two nodes (node1, node2) are monitored for disk usage. Alerts are grouped by groupId, causing repeated alerts to follow group_wait, group_interval, and repeat_interval semantics.

for : duration a metric must be abnormal before the alert fires.

group_wait : initial wait after a new group is created.

group_interval : interval between alerts when the group composition changes.

repeat_interval : interval between identical alerts when the group does not change (including recovery alerts).

Exporter Deployment

Exporters can run as sidecars (1:1 with the target pod) or as independent services (1:1 or 1:many). Sidecars bind the exporter lifecycle to the target, while independent deployments reduce coupling and are more flexible for multi‑node services such as Kafka.

Additional Tools

Use promtool to validate metric formats (e.g., ensure metric names and label names contain no dots). Port‑forwarding can expose Prometheus, Grafana, and Alertmanager for external access:

# Prometheus UI
nohup kubectl port-forward --address 0.0.0.0 service/prometheus-k8s 19090:9090 -n monitoring &
# Grafana UI
nohup kubectl port-forward --address 0.0.0.0 service/grafana 13000:3000 -n monitoring &
# Alertmanager UI
nohup kubectl port-forward --address 0.0.0.0 service/alertmanager-main 9093:9093 -n monitoring &

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Prometheus Alertmanager Exporter Alert Rules Big Data Monitoring PodMonitor

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.