Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details
This article analyzes common problems in a Prometheus‑Alertmanager monitoring setup—such as alert noise, lack of escalation, suppression and silence management—and presents a comprehensive refactor that introduces per‑cluster Alertmanager instances, custom escalation logic, suppression tables, and Python scripts to handle alert routing, silencing, and recovery.
Introduction – Alerts are closely tied to operations; a well‑designed alert system boosts efficiency and staff comfort, while a poor one creates noise like irrelevant night‑time alerts, repeated alerts, and overload.
Preparation – The environment uses Prometheus + Alertmanager (version 0.17.0). The article shares experiences with Alertmanager and outlines a recent alert‑system redesign project.
Identified problems :
Alert interference: a single Alertmanager serves multiple clusters, causing cross‑cluster alerts (e.g., cluster: clusterA\ninstance: clusterB Node\nalert_name: xxx ).
No alert escalation: alerts are not promoted based on receiver, time, or medium.
No recovery notification: Alertmanager does not send a recovery message.
Limited suppression: custom suppression intervals exist but lack intelligence and time‑of‑day awareness.
Silence management is cumbersome; UI often fails to pre‑fill silence fields.
Alertmanager lacks voice alerts.
New problems after splitting Alertmanager – Deploying one Alertmanager per cluster solves interference but raises convergence and silence‑management challenges, such as how to achieve alert aggregation and manage silences across many instances.
Refactor – Alert Interference – Deploy a dedicated Alertmanager per cluster, treat each as a database instance, and manage them centrally. This reduces cross‑cluster noise and isolates failures.
Alert Escalation – Define escalation dimensions:
Medium escalation: Email → Enterprise WeChat → SMS → Phone (after three repeats).
Receiver escalation: Primary → Secondary → Leader.
Time‑based escalation: Work hours use email/WeChat, off‑hours use SMS/Phone.
Because Alertmanager cannot natively handle this, a Python script reads active alerts and sends notifications accordingly.
Send a detailed message to the DBA by Mail #只要是告警就会通过邮件告知
if now_time > 8 and now_time < 22 :
Send a simple message to the DBA by WX
else #按照告警时间升级告警介质
if alert_count > 3 and phone_count < 3 :
Send a simple simple message to the DBA by phone #短信告警升级电话告警
elif alert_count > 3 and phone_count > 3 :
Send a simple message to the leader by phone #接收告警人员升级
else
Send a simple message to the DBA by SMS #告警介质升级Alert Convergence / Suppression – Create a MySQL table tb_alert_for_task to store alert key, state, count, next send time, and remarks. Example schema:
CREATE TABLE `tb_alert_for_task` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键',
`alert_task` varchar(100) DEFAULT '' COMMENT '告警项目',
`alert_state` tinyint(4) NOT NULL DEFAULT '0' COMMENT '告警状态, 0表示已经恢复, 1表示正在告警',
`alert_count` int(11) NOT NULL DEFAULT '0' COMMENT '告警的次数, 单个告警项发送的告警次数是10次(每天至多发送十次)',
`u_time` datetime NOT NULL DEFAULT '2021-12-08 00:00:00' COMMENT '下一次发送告警的时间',
`alert_remarks` varchar(50) NOT NULL DEFAULT '' COMMENT '告警内容',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_alert_task` (`alert_task`)
) ENGINE=InnoDB AUTO_INCREMENT=7049 DEFAULT CHARSET=utf8mb4;Suppression logic reads alerts where alert_state = 1 and either u_time > now() or alert_count > 10 , skipping those entries.
#这是抑制的逻辑, 将当前告警异常项都读出来, 如果当前alertmanager的告警已经在这里面就视为抑制对象, 因为这些告警还不满足再次发送的条件
select_sql = "select alert_task from tb_tidb_alert_for_task where alert_state = 1 and (u_time > now() or alert_count > 10);"
state, skip_instance = connect_mysql(opt = "select", sql = {"sql" : select_sql})Convergence aggregates alerts by three dimensions—cluster, alert name, and instance—stored in dictionaries global_alert_cluster , global_alert_name , and global_alert_host . The shortest dictionary determines the convergence key.
if len(global_alert_cluster.keys()) < len(global_alert_host.keys()) and len(global_alert_cluster.keys()) < len(global_alert_name.keys()):
alert = global_alert_cluster
info_tmp = "告警集群 : "
elif len(global_alert_name.keys()) < len(global_alert_host.keys()):
alert = global_alert_name
info_tmp = "告警名称 : "
else:
alert = global_alert_host
info_tmp = "告警主机 : "Alert Recovery – Periodically query the table for entries with alert_state = 1 whose u_time is older than one minute; if the instance is no longer active, update the state to recovered and reset the count.
#读取告警状态是1, 且比当前时间还早的条目
sql = """select alert_task from tb_tidb_alert_for_task where alert_state = 1 and alert_remarks = 'tidb集群告警' and u_time < date_add(now(), INTERVAL - 1 MINUTE);"""
state, alert_instance = connect_mysql(opt = "select", sql = {"sql" : sql})Silence Management – Adding a silence requires start/end UTC timestamps and a maximum duration of 24 hours. The API payload includes a single matcher (name/value) and does not support complex logical expressions.
/api/v1/silences
try:
expi_time = int(expi_time) # hours
except Exception as err:
return {"code": 1, "info": str(err)}
if expi_time > 24:
return {"code": 1, "info": "The alarm cannot be silent for more than 24 hours"}
# build start_time and end_time in UTC, then POST JSON payloadDeletion of a silence first fetches the silence ID via /api/v1/silences?silenced=false&inhibited=false , then issues a DELETE request to /api/v1/silence/{id} after matching the desired name/value pair.
/api/v1/silences?silenced=false&inhibited=false
/api/v1/silence/id
url = "http://xxx/api/v1/silence/" + item["id"]
res = json.loads(requests.delete(url).text)Final notes – The described approach is environment‑specific and should be tested thoroughly in a non‑production setting. When a centralized platform manages alerts, manual per‑instance silencing is discouraged; instead, use the platform’s UI to apply silences by IP, instance, cluster, role, or alert name, and consider bulk silencing with appropriate expiration.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.