How to Enrich Nightingale Alerts with Pending Count, Recovery Value, and Direct Links
This guide explains how to customize Nightingale monitoring alerts to include the number of pending alerts, the recovery metric value, and a link to view active alerts, using a small Python script, webhook integration, and template modifications, along with necessary database queries and Prometheus API calls.
Expected Goal
We want the alert notification to contain the following data:
Number of unprocessed alerts in the current system
The specific value when the alert recovers
A link to a page that shows unprocessed alerts
Implementation
The Nightingale monitoring database table
alert_cur_eventstores the total number of pending alerts, and Nightingale provides a panel to query them. The recovery value can be obtained via a custom recovery PromQL query.
The simplest way is to modify the
notify.pyscript. The full script is:
<code>#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import requests
import pymysql
# 处理字符问题
reload(sys)
sys.setdefaultencoding('utf-8')
# 通过api查询指标
def getPrometheus(url, promql):
response = requests.get(url, params={'query': promql})
data = json.loads(response.text)
# 提取指标数据
if response.status_code == 200:
result = data['data']['result']
if len(result) == 1:
return result[0]['value'][1]
else:
return 0
else:
return 0
def count_rows_and_get_rule_names():
try:
conn = pymysql.connect(
host='127.0.0.1',
port=3306,
user='n9e',
passwd='1234',
db='n9e_v6',
charset='utf8mb4'
)
cursor = conn.cursor()
# Count the total number of rows in the table
count_query = "SELECT COUNT(*) FROM alert_cur_event"
cursor.execute(count_query)
total_rows = cursor.fetchone()[0]
return total_rows
except Exception as e:
print("Error: ", e)
class Sender(object):
@classmethod
def send_qywx(cls, payload):
users = payload.get('event').get("notify_users_obj")
is_recovered = payload.get('event').get("is_recovered")
tokens = {}
phones = {}
res = {}
history_row = count_rows_and_get_rule_names()
if is_recovered:
# 获取自定义的恢复promql
promQL = payload.get('event').get("annotations").get("recovery_promql")
url = "http://127.0.0.1:9090/api/v1/query"
res = getPrometheus(url, promQL)
# 查询活跃告警的面板
currAlert = "http://127.0.0.1:17000/alert-cur-events"
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("qywx_robot_token", ""):
tokens[contacts.get("qywx_robot_token", "")] = 1
headers = {
"Content-Type": "application/json;charset=utf-8",
"Host": "qyapi.weixin.qq.com"
}
for t in tokens:
url = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key={}".format(t)
content = payload.get('tpls').get("qywx", "qywx not found")
content = "# **当前环境的全部告警数**: %s" % (history_row) + "\n" + content
if is_recovered:
content = content + "\n" + "> **恢复时值**: %s" % (res)
if history_row > 0:
content = content + "\n" + "[当前活跃告警](%s)" % (currAlert)
body = {
"msgtype": "markdown",
"markdown": {
"content": content
}
}
response = requests.post(url, headers=headers, data=json.dumps(body))
def main():
payload = json.load(sys.stdin)
with open('.payload', 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(ch.strip())
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")
</code>Install the required Python packages
pymysqland
requests, then place the script in Nightingale → System Settings → Notification Settings → Notification Script and enable it.
Add a notification medium named
qywxand a contact method
qywx_robot_token. When sending an alert, the medium name (e.g.,
zhangsan) determines the method name
send_zhangsan, and the token is retrieved from the contact.
Notification Template
Create a template named
qywxbased on the native one. Example template:
<code>> **Level Status**: {{if .IsRecovered}}<font color="info">Alert Recovered</font>{{else}}<font color="warning">Alert Triggered</font>{{end}}
> **Severity**: S{{.Severity}}
> **Alert Type**: {{.RuleName}}{{if .RuleNote}}
> **Alert Details**: {{.RuleNote}}{{end}}{{if .TargetIdent}}
> **Target**: {{.TargetIdent}}{{end}}
> **Metric**: {{.TagsJSON}}{{if not .IsRecovered}}
> **Trigger Value**: {{.TriggerValue}}{{end}}
{{if .IsRecovered}}> **Recovery Time**: {{timeformat .LastEvalTime}}{{else}}> **First Trigger Time**: {{timeformat .FirstTriggerTime}}{{end}}
> **Duration Since First Alert**: {{humanizeDurationInterface $time_duration}}
> **Send Time**: {{timestamp}}
</code>When configuring an alert, for example “Abnormal Pod status in K8s”, fill in the rule name, notes, and PromQL, then select the notification medium and additional information.
The additional information includes the recovery PromQL, which the Python script fetches via the Prometheus API and inserts into the template.
Alternative: Webhook
Instead of a Python script, you can use a custom webhook. Extract the
alert_cur_eventstructure, add an API to query Prometheus, and modify
notify.goto include fields
RecoveryValueand
TotalAlert. Example Go code for querying Prometheus:
<code>package prometheus
import (
"context"
"devops-webhook-service/src/server/config"
"github.com/prometheus/client_golang/api"
v1 "github.com/prometheus/client_golang/api/prometheus/v1"
"github.com/prometheus/common/model"
"github.com/toolkits/pkg/logger"
"time"
)
func GetMetricsValue(promql string) string {
client, err := api.NewClient(api.Config{Address: config.C.Prometheus.Address})
if err != nil {
logger.Error("init prometheus client failed. err: ", err)
}
queryAPI := v1.NewAPI(client)
result, warnings, err := queryAPI.Query(context.TODO(), promql, time.Now())
if err != nil {
logger.Error("query prometheus metrics failed. err: ", err)
}
if len(warnings) > 0 {
// handle warnings
}
vector := result.(model.Vector)
return vector[0].Value.String()
}
</code>Update the notification template to automatically fill the fields, e.g.:
<code>## Current environment alert total: {{ .TotalAlert }}
---
**Level Status**: {{if .IsRecovered}}<font color="info">S{{.Severity}} Recovered</font>{{else}}<font color="warning">S{{.Severity}} Triggered</font>{{end}}
**Rule Title**: {{.RuleName}}
{{if .TargetIdent}}**Target**: {{.TargetIdent}}{{end}}
{{if .IsRecovered}}**Current Value**: {{.RecoveryValue}}{{end}}
**Metric**: {{.TagsJSON}}
{{if not .IsRecovered}}**Trigger Value**: {{.TriggerValue}}{{end}}
{{if .IsRecovered}}**Recovery Time**: {{timeformat .LastEvalTime}}{{else}}**First Trigger Time**: {{timeformat .TriggerTime}}{{end}}
**Send Time**: {{timestamp}}
---
</code>Finally, add the new fields to the
NoticeEventstruct in
notify.goand adjust
GenNoticeto query Prometheus and the database.
<code>type NoticeEvent struct {
*models.AlertCurEvent
RecoveryValue string // value at recovery
TotalAlert int // total number of alerts
}
</code>These modifications enable alerts to display pending count, recovery values, and direct links, improving operational awareness.
Conclusion
The implementation can be adapted to different team needs; using webhooks offers more flexibility for features like alert claiming, suppression, and forwarding.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.