Debugging Persistent Active Alerts in Thanos Ruler: Queue Bottleneck Analysis and maxBatchSize Tuning
The article analyzes a persistent active alert observed via Thanos Ruler's HTTP interface, identifies the buffering queue bottleneck as the root cause, and proposes adjusting the maxBatchSize parameter to prevent alert delay and automatic resolution failures.
In a test environment an alert observed through Thanos Ruler's HTTP interface remained continuously active, while Alertmanager reported it as resolved.
According to the DMP platform design, an alert is considered resolved when its end time has passed, and there are three ways an alert can be marked resolved: manual resolution, a single‑occurrence alert that sends a resolved state on the next rule evaluation, or an automatic resolve time that can be reset to 24 hours.
Since the test environment had no manual resolutions, the alert was not a single‑occurrence case, and the automatic resolve interval did not match the observed behavior, the issue was traced to the processing stages before the alert reached Alertmanager.
The analysis focused on the Thanos Ruler component, which sits between Prometheus (providing metric data) and Alertmanager (receiving alerts). Ruler maintains a local active queue for each rule and a buffering queue (Thanos Rule Queue) with a default capacity of 10,000 and a maxBatchSize of 100.
When an alert is placed into the buffering queue, it receives a default automatic resolve time of current time + 3 minutes. Two stages can affect processing: the buffering queue stage and the network latency stage. Network latency was ruled out, so the buffering queue was examined.
If an alert in the local queue has not been sent for more than 1 minute, it is moved to the buffering queue, which then pushes up to maxBatchSize alerts to Alertmanager. High volumes of duplicate alerts can reduce the push frequency, causing the queue to accumulate.
Log analysis showed the queue pushed roughly every 10 seconds, meaning a single alert could remain in the queue for about (capacity / maxBatchSize) × 10 s ≈ 16 minutes, exceeding the default 3‑minute automatic resolve time.
Three Thanos metrics (thanos_alert_queue_alerts_dropped_total, thanos_alert_queue_alerts_pushed_total, thanos_alert_queue_alerts_popped_total) confirmed queue saturation and alert loss.
The root cause was identified as the buffering queue’s low push frequency combined with a small maxBatchSize, leading to alert delay. The solution is to increase maxBatchSize based on the maximum number of monitored entities (e.g., the number of MySQL instances) using the formula:
maxBatchSize >= (x1*y1 + x2*y2 + ... + xn*yn) / (y1 + y2 + ... + yn)
In practice, setting maxBatchSize to the maximum count of a single monitored entity type (e.g., the highest number of MySQL instances) prevents queue buildup.
When using the integrated Ruler component in DMP, this parameter can be adjusted directly; for upstream Thanos Ruler, source code modifications in cmd/thanos/rule.go are required.
Source file: https://github.com/thanos-io/thanos/blob/master/cmd/thanos/rule.go
References:
1. https://thanos.io/tip/thanos/design.md 2. https://thanos.io/tip/components/rule.md
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.