Inside Prometheus Alerting Rules: How They’re Managed and Executed
This article explains Prometheus' custom Rule system, detailing the structure and components of alerting rules, the rule manager's loading and updating process, group scheduling, evaluation cycles, and the logic for generating, updating, and sending alerts, enabling advanced monitoring extensions.
What is a Rule
Prometheus supports user‑defined Rule configurations. Rules are of two types: Recording Rules, which pre‑compute complex PromQL queries for faster reuse, and Alerting Rules, which define conditions that trigger alerts when evaluated.
This article focuses on the analysis of alerting rules. An alerting rule lets you specify a PromQL expression as the trigger condition; Prometheus periodically evaluates the expression and sends a notification when the condition is met.
What is an Alerting Rule
Alerting is a core feature of Prometheus. Below is a typical alert rule definition:
<code>groups:
- name: example
rules:
- alert: HighErrorRate
# The metric must be > 0.5 for the last 10 minutes.
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
description: description info
</code>An alert rule file groups related rules under a
group. Each rule consists of:
alert : the rule name.
expr : a PromQL expression that determines when the alert fires.
for : optional waiting period; the condition must hold for this duration before the alert is sent.
labels : custom labels attached to the alert.
annotations : additional information (e.g., description) sent to Alertmanager.
Rule Manager
The manager loads rule files, parses them into
Groupobjects, and coordinates evaluation. A simplified manager struct:
<code>type Manager struct {
opts *ManagerOptions // external dependencies (storage, notify, etc.)
groups map[string]*Group // current rule groups
mtx sync.RWMutex // protects groups
block chan struct{}
done chan struct{}
restored bool
logger log.Logger
}
</code>Key fields:
opts: holds references to storage, notification modules, etc.
groups: maps a group identifier to its
Groupinstance.
mtx: read‑write lock for concurrent access.
Loading Rule Groups
When the Prometheus server starts,
Manager.Update()is called to load and parse rule files:
Calls
Manager.LoadGroups()to obtain a set of
Groupobjects.
Stops old groups and starts new ones, launching a goroutine for each group to evaluate its PromQL queries.
<code>func (m *Manager) Update(interval time.Duration, files []string, externalLabels labels.Labels, externalURL string) error {
m.mtx.Lock()
defer m.mtx.Unlock()
groups, errs := m.LoadGroups(interval, externalLabels, externalURL, files...)
if errs != nil {
for _, e := range errs {
level.Error(m.logger).Log("msg", "loading groups failed", "err", e)
}
return errors.New("error loading rules, previous rule set restored")
}
m.restored = true
var wg sync.WaitGroup
for _, newg := range groups {
gn := GroupKey(newg.file, newg.name)
oldg, ok := m.groups[gn]
delete(m.groups, gn)
if ok && oldg.Equals(newg) {
groups[gn] = oldg
continue
}
wg.Add(1)
go func(newg *Group) {
if ok {
oldg.stop()
newg.CopyState(oldg)
}
wg.Done()
<-m.block
newg.run(m.opts.Context)
}(newg)
}
// stop remaining old groups
wg.Add(len(m.groups))
for n, oldg := range m.groups {
go func(n string, g *Group) {
g.markStale = true
g.stop()
if m := g.metrics; m != nil {
m.IterationsMissed.DeleteLabelValues(n)
m.IterationsScheduled.DeleteLabelValues(n)
m.EvalTotal.DeleteLabelValues(n)
m.EvalFailures.DeleteLabelValues(n)
m.GroupInterval.DeleteLabelValues(n)
m.GroupLastEvalTime.DeleteLabelValues(n)
m.GroupLastDuration.DeleteLabelValues(n)
m.GroupRules.DeleteLabelValues(n)
m.GroupSamples.DeleteLabelValues(n)
}
wg.Done()
}(n, oldg)
}
wg.Wait()
m.groups = groups
return nil
}
</code>Running a Rule Group
Each
Groupruns a loop with a ticker based on
g.interval(default 1 minute, configurable via
global.evaluation_interval). The loop calls
g.Eval()to evaluate all rules in the group.
<code>func (g *Group) run(ctx context.Context) {
defer close(g.terminated)
evalTimestamp := g.EvalTimestamp(time.Now().UnixNano()).Add(g.interval)
select {
case <-time.After(time.Until(evalTimestamp)):
case <-g.done:
return
}
ctx = promql.NewOriginContext(ctx, map[string]interface{}{"ruleGroup": map[string]string{"file": g.File(), "name": g.Name()}})
iter := func() {
g.metrics.IterationsScheduled.WithLabelValues(GroupKey(g.file, g.name)).Inc()
start := time.Now()
g.Eval(ctx, evalTimestamp)
g.metrics.IterationDuration.Observe(time.Since(start).Seconds())
g.setEvaluationTime(time.Since(start))
g.setLastEvaluation(start)
}
tick := time.NewTicker(g.interval)
defer tick.Stop()
// initial evaluation
iter()
for {
select {
case <-g.done:
return
case <-tick.C:
// handle missed intervals
missed := (time.Since(evalTimestamp) / g.interval) - 1
if missed > 0 {
g.metrics.IterationsMissed.WithLabelValues(GroupKey(g.file, g.name)).Add(float64(missed))
g.metrics.IterationsScheduled.WithLabelValues(GroupKey(g.file, g.name)).Add(float64(missed))
}
evalTimestamp = evalTimestamp.Add((missed + 1) * g.interval)
iter()
}
}
}
</code>Evaluating Individual Rules
During
Group.Eval(), each rule is evaluated via the provided
QueryFunc. For
AlertingRuleinstances, the resulting alerts are sent through the configured
NotifyFunc. Recording rules store their results back into the TSDB.
<code>func (g *Group) Eval(ctx context.Context, ts time.Time) {
var samplesTotal float64
for i, rule := range g.rules {
select {
case <-g.done:
return
default:
}
// evaluate rule
vector, err := rule.Eval(ctx, ts, g.opts.QueryFunc, g.opts.ExternalURL)
if err != nil {
rule.SetHealth(HealthBad)
rule.SetLastError(err)
g.metrics.EvalFailures.WithLabelValues(GroupKey(g.File(), g.Name())).Inc()
continue
}
samplesTotal += float64(len(vector))
if ar, ok := rule.(*AlertingRule); ok {
ar.sendAlerts(ctx, ts, g.opts.ResendDelay, g.interval, g.opts.NotifyFunc)
}
// handling of RecordingRule results omitted for brevity
}
if g.metrics != nil {
g.metrics.GroupSamples.WithLabelValues(GroupKey(g.File(), g.Name())).Set(samplesTotal)
}
g.cleanupStaleSeries(ctx, ts)
}
</code>AlertingRule Structure and Lifecycle
The
AlertingRulestruct holds the rule name, expression, hold duration, labels, annotations, and runtime state such as active alerts, evaluation timestamps, and health.
<code>type AlertingRule struct {
name string
vector parser.Expr
holdDuration time.Duration
labels labels.Labels
annotations labels.Labels
externalLabels map[string]string
restored bool
mtx sync.Mutex
evaluationDuration time.Duration
evaluationTimestamp time.Time
health RuleHealth
lastError error
active map[uint64]*Alert
logger log.Logger
}
</code>During evaluation, the rule hashes each result’s label set to determine whether an alert already exists. New alerts are added to
active, existing alerts are updated, and alerts that disappear are either marked
StateInactiveor removed after a timeout.
<code>func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc, externalURL *url.URL) (promql.Vector, error) {
res, err := query(ctx, r.vector.String(), ts)
if err != nil {
r.SetHealth(HealthBad)
r.SetLastError(err)
return nil, err
}
// process result vector, update r.active map, manage state transitions
// omitted for brevity
return res, nil
}
</code>Sending Alerts
After evaluation,
AlertingRule.sendAlertsiterates over active alerts and sends those that need to be notified based on their state, the configured
ResendDelay, and the rule’s evaluation interval.
<code>func (r *AlertingRule) sendAlerts(ctx context.Context, ts time.Time, resendDelay, interval time.Duration, notifyFunc NotifyFunc) {
alerts := []*Alert{}
r.ForEachActiveAlert(func(alert *Alert) {
if alert.needsSending(ts, resendDelay) {
alert.LastSentAt = ts
delta := resendDelay
if interval > resendDelay {
delta = interval
}
alert.ValidUntil = ts.Add(4 * delta)
copy := *alert
alerts = append(alerts, ©)
}
})
notifyFunc(ctx, r.vector.String(), alerts...)
}
func (a *Alert) needsSending(ts time.Time, resendDelay time.Duration) bool {
if a.State == StatePending {
return false
}
if a.ResolvedAt.After(a.LastSentAt) {
return true
}
return a.LastSentAt.Add(resendDelay).Before(ts)
}
</code>In summary, Prometheus evaluates alerting rules on a fixed interval, maintains active alert state, and dispatches notifications according to hold durations, resend delays, and state transitions. Understanding this flow enables developers to extend Prometheus, for example by loading rule groups dynamically from a database.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.