Operations 12 min read

How to Enrich Nightingale Alerts with Pending Count, Recovery Value, and Direct Links

This guide explains how to customize Nightingale monitoring alerts to include the number of pending alerts, the recovery metric value, and a link to view active alerts, using a small Python script, webhook integration, and template modifications, along with necessary database queries and Prometheus API calls.

Ops Development Stories

Sep 6, 2023

How to Enrich Nightingale Alerts with Pending Count, Recovery Value, and Direct Links

Expected Goal

We want the alert notification to contain the following data:

Number of unprocessed alerts in the current system

The specific value when the alert recovers

A link to a page that shows unprocessed alerts

Implementation

The Nightingale monitoring database table alert_cur_event stores the total number of pending alerts, and Nightingale provides a panel to query them. The recovery value can be obtained via a custom recovery PromQL query.

The simplest way is to modify the notify.py script. The full script is:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import requests
import pymysql

# 处理字符问题
reload(sys)
sys.setdefaultencoding('utf-8')

# 通过api查询指标
def getPrometheus(url, promql):
    response = requests.get(url, params={'query': promql})
    data = json.loads(response.text)
    # 提取指标数据
    if response.status_code == 200:
        result = data['data']['result']
        if len(result) == 1:
            return result[0]['value'][1]
        else:
            return 0
    else:
        return 0

def count_rows_and_get_rule_names():
    try:
        conn = pymysql.connect(
            host='127.0.0.1',
            port=3306,
            user='n9e',
            passwd='1234',
            db='n9e_v6',
            charset='utf8mb4'
        )
        cursor = conn.cursor()
        # Count the total number of rows in the table
        count_query = "SELECT COUNT(*) FROM alert_cur_event"
        cursor.execute(count_query)
        total_rows = cursor.fetchone()[0]
        return total_rows
    except Exception as e:
        print("Error: ", e)

class Sender(object):
    @classmethod
    def send_qywx(cls, payload):
        users = payload.get('event').get("notify_users_obj")
        is_recovered = payload.get('event').get("is_recovered")
        tokens = {}
        phones = {}
        res = {}

        history_row = count_rows_and_get_rule_names()

        if is_recovered:
            # 获取自定义的恢复promql
            promQL = payload.get('event').get("annotations").get("recovery_promql")
            url = "http://127.0.0.1:9090/api/v1/query"
            res = getPrometheus(url, promQL)

        # 查询活跃告警的面板
        currAlert = "http://127.0.0.1:17000/alert-cur-events"
        for u in users:
            if u.get("phone"):
                phones[u.get("phone")] = 1
            contacts = u.get("contacts")
            if contacts.get("qywx_robot_token", ""):
                tokens[contacts.get("qywx_robot_token", "")] = 1

        headers = {
            "Content-Type": "application/json;charset=utf-8",
            "Host": "qyapi.weixin.qq.com"
        }

        for t in tokens:
            url = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key={}".format(t)
            content = payload.get('tpls').get("qywx", "qywx not found")
            content = "# **当前环境的全部告警数**: %s" % (history_row) + "
" + content
            if is_recovered:
                content = content + "
" + "> **恢复时值**: %s" % (res)
            if history_row > 0:
                content = content + "
" + "[当前活跃告警](%s)" % (currAlert)
            body = {
                "msgtype": "markdown",
                "markdown": {
                    "content": content
                }
            }
            response = requests.post(url, headers=headers, data=json.dumps(body))

def main():
    payload = json.load(sys.stdin)
    with open('.payload', 'w') as f:
        f.write(json.dumps(payload, indent=4))
    for ch in payload.get('event').get('notify_channels'):
        send_func_name = "send_{}".format(ch.strip())
        if not hasattr(Sender, send_func_name):
            print("function: {} not found", send_func_name)
            continue
        send_func = getattr(Sender, send_func_name)
        send_func(payload)

def hello():
    print("hello nightingale")

if __name__ == "__main__":
    if len(sys.argv) == 1:
        main()
    elif sys.argv[1] == "hello":
        hello()
    else:
        print("I am confused")

Install the required Python packages pymysql and requests, then place the script in Nightingale → System Settings → Notification Settings → Notification Script and enable it.

Add a notification medium named qywx and a contact method qywx_robot_token. When sending an alert, the medium name (e.g., zhangsan) determines the method name send_zhangsan, and the token is retrieved from the contact.

Notification Template

Create a template named qywx based on the native one. Example template:

> **Level Status**: {{if .IsRecovered}}<font color="info">Alert Recovered</font>{{else}}<font color="warning">Alert Triggered</font>{{end}}
> **Severity**: S{{.Severity}}
> **Alert Type**: {{.RuleName}}{{if .RuleNote}}
> **Alert Details**: {{.RuleNote}}{{end}}{{if .TargetIdent}}
> **Target**: {{.TargetIdent}}{{end}}
> **Metric**: {{.TagsJSON}}{{if not .IsRecovered}}
> **Trigger Value**: {{.TriggerValue}}{{end}}
{{if .IsRecovered}}> **Recovery Time**: {{timeformat .LastEvalTime}}{{else}}> **First Trigger Time**: {{timeformat .FirstTriggerTime}}{{end}}
> **Duration Since First Alert**: {{humanizeDurationInterface $time_duration}}
> **Send Time**: {{timestamp}}

When configuring an alert, for example “Abnormal Pod status in K8s”, fill in the rule name, notes, and PromQL, then select the notification medium and additional information.

The additional information includes the recovery PromQL, which the Python script fetches via the Prometheus API and inserts into the template.

Alternative: Webhook

Instead of a Python script, you can use a custom webhook. Extract the alert_cur_event structure, add an API to query Prometheus, and modify notify.go to include fields RecoveryValue and TotalAlert. Example Go code for querying Prometheus:

package prometheus

import (
    "context"
    "devops-webhook-service/src/server/config"
    "github.com/prometheus/client_golang/api"
    v1 "github.com/prometheus/client_golang/api/prometheus/v1"
    "github.com/prometheus/common/model"
    "github.com/toolkits/pkg/logger"
    "time"
)

func GetMetricsValue(promql string) string {
    client, err := api.NewClient(api.Config{Address: config.C.Prometheus.Address})
    if err != nil {
        logger.Error("init prometheus client failed. err: ", err)
    }
    queryAPI := v1.NewAPI(client)
    result, warnings, err := queryAPI.Query(context.TODO(), promql, time.Now())
    if err != nil {
        logger.Error("query prometheus metrics failed. err: ", err)
    }
    if len(warnings) > 0 {
        // handle warnings
    }
    vector := result.(model.Vector)
    return vector[0].Value.String()
}

Update the notification template to automatically fill the fields, e.g.:

## Current environment alert total: {{ .TotalAlert }}
---
**Level Status**: {{if .IsRecovered}}<font color="info">S{{.Severity}} Recovered</font>{{else}}<font color="warning">S{{.Severity}} Triggered</font>{{end}}
**Rule Title**: {{.RuleName}}
{{if .TargetIdent}}**Target**: {{.TargetIdent}}{{end}}
{{if .IsRecovered}}**Current Value**: {{.RecoveryValue}}{{end}}
**Metric**: {{.TagsJSON}}
{{if not .IsRecovered}}**Trigger Value**: {{.TriggerValue}}{{end}}
{{if .IsRecovered}}**Recovery Time**: {{timeformat .LastEvalTime}}{{else}}**First Trigger Time**: {{timeformat .TriggerTime}}{{end}}
**Send Time**: {{timestamp}}
---

Finally, add the new fields to the NoticeEvent struct in notify.go and adjust GenNotice to query Prometheus and the database.

type NoticeEvent struct {
    *models.AlertCurEvent
    RecoveryValue string // value at recovery
    TotalAlert    int    // total number of alerts
}

These modifications enable alerts to display pending count, recovery values, and direct links, improving operational awareness.

Conclusion

The implementation can be adapted to different team needs; using webhooks offers more flexibility for features like alert claiming, suppression, and forwarding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring cloud-native webhook

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.