Operations 5 min read

Why Prometheus Alerts Fail: Delays, False Alarms, and How to Fix Them

This article examines common Prometheus alerting problems—missed alerts, unexpected alerts, and delayed notifications—explains the underlying configuration defaults, and offers practical guidance on tuning scrape intervals, evaluation periods, and Alertmanager group settings to resolve them.

Ops Development Stories

Mar 31, 2021

Why Prometheus Alerts Fail: Delays, False Alarms, and How to Fix Them

Today I’ll discuss alerting issues I encountered while using Prometheus.

Problem Analysis

Recently while operating Prometheus I found that sometimes alerts that should fire do not, sometimes alerts fire when they shouldn’t, and sometimes alerts are noticeably delayed. To pinpoint the causes I consulted documentation and the official site, hoping this helps others.

First, here are some default important configurations for Prometheus and Alertmanager as provided by the official docs:

# prometheus
global:
  # How frequently to scrape targets by default. 从目标抓取监控数据的间隔
  [ scrape_interval: <duration> | default = 1m ]
  # How long until a scrape request times out. 从目标住区数据的超时时间
  [ scrape_timeout: <duration> | default = 10s ]
  # How frequently to evaluate rules. 告警规则评估的时间间隔
  [ evaluation_interval: <duration> | default = 1m ]
# alertmanager
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ] # 初次发送告警的等待时间

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ] # 同一个组其他新发生的告警发送时间间隔

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ] # 重复发送同一个告警的时间间隔

Using the above configuration, we can view the entire alert flow and identify problems.

According to the diagram and configuration, after Prometheus scrapes data and evaluates alert rules, a true expression moves the alert to pending; after the duration defined by “for” it becomes active, and the data is pushed to Alertmanager, which sends notifications after group_wait.

Alert delay or frequent alerts

From the flow, if group_wait is set too large, alerts are delayed; if set too small, alerts fire too frequently. Settings should match the specific scenario.

Unexpected alerts

Prometheus scrapes targets every scrape_interval. By the time the “for” period elapses, the target may have returned to normal, but the alert still fires because the evaluation considered the earlier abnormal data. Grafana, using a range query, may show the data as normal, leading to perceived false alerts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

configuration

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.