Operations 18 min read

Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details

This article analyzes common problems in a Prometheus‑Alertmanager monitoring setup—such as alert noise, lack of escalation, suppression and silence management—and presents a comprehensive refactor that introduces per‑cluster Alertmanager instances, custom escalation logic, suppression tables, and Python scripts to handle alert routing, silencing, and recovery.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details

Introduction – Alerts are closely tied to operations; a well‑designed alert system boosts efficiency and staff comfort, while a poor one creates noise like irrelevant night‑time alerts, repeated alerts, and overload.

Preparation – The environment uses Prometheus + Alertmanager (version 0.17.0). The article shares experiences with Alertmanager and outlines a recent alert‑system redesign project.

Identified problems :

Alert interference: a single Alertmanager serves multiple clusters, causing cross‑cluster alerts (e.g., cluster: clusterA\ninstance: clusterB Node\nalert_name: xxx ).

No alert escalation: alerts are not promoted based on receiver, time, or medium.

No recovery notification: Alertmanager does not send a recovery message.

Limited suppression: custom suppression intervals exist but lack intelligence and time‑of‑day awareness.

Silence management is cumbersome; UI often fails to pre‑fill silence fields.

Alertmanager lacks voice alerts.

New problems after splitting Alertmanager – Deploying one Alertmanager per cluster solves interference but raises convergence and silence‑management challenges, such as how to achieve alert aggregation and manage silences across many instances.

Refactor – Alert Interference – Deploy a dedicated Alertmanager per cluster, treat each as a database instance, and manage them centrally. This reduces cross‑cluster noise and isolates failures.

Alert Escalation – Define escalation dimensions:

Medium escalation: Email → Enterprise WeChat → SMS → Phone (after three repeats).

Receiver escalation: Primary → Secondary → Leader.

Time‑based escalation: Work hours use email/WeChat, off‑hours use SMS/Phone.

Because Alertmanager cannot natively handle this, a Python script reads active alerts and sends notifications accordingly.

Send a detailed message to the DBA by Mail #只要是告警就会通过邮件告知
    
    if now_time > 8 and now_time < 22 :
        
        Send a simple message to the DBA by WX            
    
    else                                   #按照告警时间升级告警介质
        
        if alert_count > 3 and phone_count < 3 :
            
            Send a simple simple message to the DBA by phone      #短信告警升级电话告警
            
        elif alert_count > 3 and phone_count > 3 :
            
            Send a simple message to the leader by phone  #接收告警人员升级
            
        else
            
            Send a simple message to the DBA by SMS       #告警介质升级

Alert Convergence / Suppression – Create a MySQL table tb_alert_for_task to store alert key, state, count, next send time, and remarks. Example schema:

CREATE TABLE `tb_alert_for_task` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键',
  `alert_task` varchar(100) DEFAULT '' COMMENT '告警项目',
  `alert_state` tinyint(4) NOT NULL DEFAULT '0' COMMENT '告警状态, 0表示已经恢复, 1表示正在告警',
  `alert_count` int(11) NOT NULL DEFAULT '0' COMMENT '告警的次数, 单个告警项发送的告警次数是10次(每天至多发送十次)',
  `u_time` datetime NOT NULL DEFAULT '2021-12-08 00:00:00' COMMENT '下一次发送告警的时间',
  `alert_remarks` varchar(50) NOT NULL DEFAULT '' COMMENT '告警内容',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_alert_task` (`alert_task`)
) ENGINE=InnoDB AUTO_INCREMENT=7049 DEFAULT CHARSET=utf8mb4;

Suppression logic reads alerts where alert_state = 1 and either u_time > now() or alert_count > 10 , skipping those entries.

#这是抑制的逻辑, 将当前告警异常项都读出来, 如果当前alertmanager的告警已经在这里面就视为抑制对象, 因为这些告警还不满足再次发送的条件

select_sql = "select alert_task from tb_tidb_alert_for_task where alert_state = 1 and (u_time > now() or alert_count > 10);"
state, skip_instance = connect_mysql(opt = "select", sql = {"sql" : select_sql})

Convergence aggregates alerts by three dimensions—cluster, alert name, and instance—stored in dictionaries global_alert_cluster , global_alert_name , and global_alert_host . The shortest dictionary determines the convergence key.

if len(global_alert_cluster.keys()) < len(global_alert_host.keys()) and len(global_alert_cluster.keys()) < len(global_alert_name.keys()):
    alert = global_alert_cluster
    info_tmp = "告警集群 : "
elif len(global_alert_name.keys()) < len(global_alert_host.keys()):
    alert = global_alert_name
    info_tmp = "告警名称 : "
else:
    alert = global_alert_host
    info_tmp = "告警主机 : "

Alert Recovery – Periodically query the table for entries with alert_state = 1 whose u_time is older than one minute; if the instance is no longer active, update the state to recovered and reset the count.

#读取告警状态是1, 且比当前时间还早的条目
sql = """select alert_task from tb_tidb_alert_for_task where alert_state = 1 and alert_remarks = 'tidb集群告警' and u_time < date_add(now(), INTERVAL - 1 MINUTE);"""
state, alert_instance = connect_mysql(opt = "select", sql = {"sql" : sql})

Silence Management – Adding a silence requires start/end UTC timestamps and a maximum duration of 24 hours. The API payload includes a single matcher (name/value) and does not support complex logical expressions.

/api/v1/silences

try:
    expi_time = int(expi_time)  # hours
except Exception as err:
    return {"code": 1, "info": str(err)}
if expi_time > 24:
    return {"code": 1, "info": "The alarm cannot be silent for more than 24 hours"}
# build start_time and end_time in UTC, then POST JSON payload

Deletion of a silence first fetches the silence ID via /api/v1/silences?silenced=false&inhibited=false , then issues a DELETE request to /api/v1/silence/{id} after matching the desired name/value pair.

/api/v1/silences?silenced=false&inhibited=false
/api/v1/silence/id

url = "http://xxx/api/v1/silence/" + item["id"]
res = json.loads(requests.delete(url).text)

Final notes – The described approach is environment‑specific and should be tested thoroughly in a non‑production setting. When a centralized platform manages alerts, manual per‑instance silencing is discouraged; instead, use the platform’s UI to apply silences by IP, instance, cluster, role, or alert name, and consider bulk silencing with appropriate expiration.

monitoringPythonPrometheusAlertmanagerAlert EscalationAlert Suppression
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.