Operations 21 min read

Inside Prometheus Alerting Rules: How They’re Managed and Executed

This article explains Prometheus' custom Rule system, detailing the structure and components of alerting rules, the rule manager's loading and updating process, group scheduling, evaluation cycles, and the logic for generating, updating, and sending alerts, enabling advanced monitoring extensions.

Ops Development Stories

Aug 27, 2021

Inside Prometheus Alerting Rules: How They’re Managed and Executed

What is a Rule

Prometheus supports user‑defined Rule configurations. Rules are of two types: Recording Rules, which pre‑compute complex PromQL queries for faster reuse, and Alerting Rules, which define conditions that trigger alerts when evaluated.

This article focuses on the analysis of alerting rules. An alerting rule lets you specify a PromQL expression as the trigger condition; Prometheus periodically evaluates the expression and sends a notification when the condition is met.

What is an Alerting Rule

Alerting is a core feature of Prometheus. Below is a typical alert rule definition:

groups:
- name: example
  rules:
  - alert: HighErrorRate
    # The metric must be > 0.5 for the last 10 minutes.
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency
      description: description info

An alert rule file groups related rules under a group. Each rule consists of:

alert : the rule name.

expr : a PromQL expression that determines when the alert fires.

for : optional waiting period; the condition must hold for this duration before the alert is sent.

labels : custom labels attached to the alert.

annotations : additional information (e.g., description) sent to Alertmanager.

Rule Manager

The manager loads rule files, parses them into Group objects, and coordinates evaluation. A simplified manager struct:

type Manager struct {
    opts      *ManagerOptions // external dependencies (storage, notify, etc.)
    groups    map[string]*Group // current rule groups
    mtx       sync.RWMutex // protects groups
    block     chan struct{}
    done      chan struct{}
    restored  bool
    logger    log.Logger
}

Key fields: opts: holds references to storage, notification modules, etc. groups: maps a group identifier to its Group instance. mtx: read‑write lock for concurrent access.

Loading Rule Groups

When the Prometheus server starts, Manager.Update() is called to load and parse rule files:

Calls Manager.LoadGroups() to obtain a set of Group objects.

Stops old groups and starts new ones, launching a goroutine for each group to evaluate its PromQL queries.

func (m *Manager) Update(interval time.Duration, files []string, externalLabels labels.Labels, externalURL string) error {
    m.mtx.Lock()
    defer m.mtx.Unlock()
    groups, errs := m.LoadGroups(interval, externalLabels, externalURL, files...)
    if errs != nil {
        for _, e := range errs {
            level.Error(m.logger).Log("msg", "loading groups failed", "err", e)
        }
        return errors.New("error loading rules, previous rule set restored")
    }
    m.restored = true
    var wg sync.WaitGroup
    for _, newg := range groups {
        gn := GroupKey(newg.file, newg.name)
        oldg, ok := m.groups[gn]
        delete(m.groups, gn)
        if ok && oldg.Equals(newg) {
            groups[gn] = oldg
            continue
        }
        wg.Add(1)
        go func(newg *Group) {
            if ok {
                oldg.stop()
                newg.CopyState(oldg)
            }
            wg.Done()
            <-m.block
            newg.run(m.opts.Context)
        }(newg)
    }
    // stop remaining old groups
    wg.Add(len(m.groups))
    for n, oldg := range m.groups {
        go func(n string, g *Group) {
            g.markStale = true
            g.stop()
            if m := g.metrics; m != nil {
                m.IterationsMissed.DeleteLabelValues(n)
                m.IterationsScheduled.DeleteLabelValues(n)
                m.EvalTotal.DeleteLabelValues(n)
                m.EvalFailures.DeleteLabelValues(n)
                m.GroupInterval.DeleteLabelValues(n)
                m.GroupLastEvalTime.DeleteLabelValues(n)
                m.GroupLastDuration.DeleteLabelValues(n)
                m.GroupRules.DeleteLabelValues(n)
                m.GroupSamples.DeleteLabelValues(n)
            }
            wg.Done()
        }(n, oldg)
    }
    wg.Wait()
    m.groups = groups
    return nil
}

Running a Rule Group

Each Group runs a loop with a ticker based on g.interval (default 1 minute, configurable via global.evaluation_interval). The loop calls g.Eval() to evaluate all rules in the group.

func (g *Group) run(ctx context.Context) {
    defer close(g.terminated)
    evalTimestamp := g.EvalTimestamp(time.Now().UnixNano()).Add(g.interval)
    select {
    case <-time.After(time.Until(evalTimestamp)):
    case <-g.done:
        return
    }
    ctx = promql.NewOriginContext(ctx, map[string]interface{}{"ruleGroup": map[string]string{"file": g.File(), "name": g.Name()}})
    iter := func() {
        g.metrics.IterationsScheduled.WithLabelValues(GroupKey(g.file, g.name)).Inc()
        start := time.Now()
        g.Eval(ctx, evalTimestamp)
        g.metrics.IterationDuration.Observe(time.Since(start).Seconds())
        g.setEvaluationTime(time.Since(start))
        g.setLastEvaluation(start)
    }
    tick := time.NewTicker(g.interval)
    defer tick.Stop()
    // initial evaluation
    iter()
    for {
        select {
        case <-g.done:
            return
        case <-tick.C:
            // handle missed intervals
            missed := (time.Since(evalTimestamp) / g.interval) - 1
            if missed > 0 {
                g.metrics.IterationsMissed.WithLabelValues(GroupKey(g.file, g.name)).Add(float64(missed))
                g.metrics.IterationsScheduled.WithLabelValues(GroupKey(g.file, g.name)).Add(float64(missed))
            }
            evalTimestamp = evalTimestamp.Add((missed + 1) * g.interval)
            iter()
        }
    }
}

Evaluating Individual Rules

During Group.Eval(), each rule is evaluated via the provided QueryFunc. For AlertingRule instances, the resulting alerts are sent through the configured NotifyFunc. Recording rules store their results back into the TSDB.

func (g *Group) Eval(ctx context.Context, ts time.Time) {
    var samplesTotal float64
    for i, rule := range g.rules {
        select {
        case <-g.done:
            return
        default:
        }
        // evaluate rule
        vector, err := rule.Eval(ctx, ts, g.opts.QueryFunc, g.opts.ExternalURL)
        if err != nil {
            rule.SetHealth(HealthBad)
            rule.SetLastError(err)
            g.metrics.EvalFailures.WithLabelValues(GroupKey(g.File(), g.Name())).Inc()
            continue
        }
        samplesTotal += float64(len(vector))
        if ar, ok := rule.(*AlertingRule); ok {
            ar.sendAlerts(ctx, ts, g.opts.ResendDelay, g.interval, g.opts.NotifyFunc)
        }
        // handling of RecordingRule results omitted for brevity
    }
    if g.metrics != nil {
        g.metrics.GroupSamples.WithLabelValues(GroupKey(g.File(), g.Name())).Set(samplesTotal)
    }
    g.cleanupStaleSeries(ctx, ts)
}

AlertingRule Structure and Lifecycle

The AlertingRule struct holds the rule name, expression, hold duration, labels, annotations, and runtime state such as active alerts, evaluation timestamps, and health.

type AlertingRule struct {
    name               string
    vector             parser.Expr
    holdDuration       time.Duration
    labels             labels.Labels
    annotations        labels.Labels
    externalLabels     map[string]string
    restored           bool
    mtx                sync.Mutex
    evaluationDuration time.Duration
    evaluationTimestamp time.Time
    health             RuleHealth
    lastError          error
    active             map[uint64]*Alert
    logger             log.Logger
}

During evaluation, the rule hashes each result’s label set to determine whether an alert already exists. New alerts are added to active, existing alerts are updated, and alerts that disappear are either marked StateInactive or removed after a timeout.

func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc, externalURL *url.URL) (promql.Vector, error) {
    res, err := query(ctx, r.vector.String(), ts)
    if err != nil {
        r.SetHealth(HealthBad)
        r.SetLastError(err)
        return nil, err
    }
    // process result vector, update r.active map, manage state transitions
    // omitted for brevity
    return res, nil
}

Sending Alerts

After evaluation, AlertingRule.sendAlerts iterates over active alerts and sends those that need to be notified based on their state, the configured ResendDelay, and the rule’s evaluation interval.

func (r *AlertingRule) sendAlerts(ctx context.Context, ts time.Time, resendDelay, interval time.Duration, notifyFunc NotifyFunc) {
    alerts := []*Alert{}
    r.ForEachActiveAlert(func(alert *Alert) {
        if alert.needsSending(ts, resendDelay) {
            alert.LastSentAt = ts
            delta := resendDelay
            if interval > resendDelay {
                delta = interval
            }
            alert.ValidUntil = ts.Add(4 * delta)
            copy := *alert
            alerts = append(alerts, ©)
        }
    })
    notifyFunc(ctx, r.vector.String(), alerts...)
}

func (a *Alert) needsSending(ts time.Time, resendDelay time.Duration) bool {
    if a.State == StatePending {
        return false
    }
    if a.ResolvedAt.After(a.LastSentAt) {
        return true
    }
    return a.LastSentAt.Add(resendDelay).Before(ts)
}

In summary, Prometheus evaluates alerting rules on a fixed interval, maintains active alert state, and dispatches notifications according to hold durations, resend delays, and state transitions. Understanding this flow enables developers to extend Prometheus, for example by loading rule groups dynamically from a database.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Go Prometheus Rule Management Alerting Rules

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.