Operations 8 min read

Debugging Persistent Active Alerts in Thanos Ruler: Queue Bottleneck Analysis and maxBatchSize Tuning

The article analyzes a persistent active alert observed via Thanos Ruler's HTTP interface, identifies the buffering queue bottleneck as the root cause, and proposes adjusting the maxBatchSize parameter to prevent alert delay and automatic resolution failures.

Aikesheng Open Source Community

Oct 26, 2020

Debugging Persistent Active Alerts in Thanos Ruler: Queue Bottleneck Analysis and maxBatchSize Tuning

In a test environment an alert observed through Thanos Ruler's HTTP interface remained continuously active, while Alertmanager reported it as resolved.

According to the DMP platform design, an alert is considered resolved when its end time has passed, and there are three ways an alert can be marked resolved: manual resolution, a single‑occurrence alert that sends a resolved state on the next rule evaluation, or an automatic resolve time that can be reset to 24 hours.

Since the test environment had no manual resolutions, the alert was not a single‑occurrence case, and the automatic resolve interval did not match the observed behavior, the issue was traced to the processing stages before the alert reached Alertmanager.

The analysis focused on the Thanos Ruler component, which sits between Prometheus (providing metric data) and Alertmanager (receiving alerts). Ruler maintains a local active queue for each rule and a buffering queue (Thanos Rule Queue) with a default capacity of 10,000 and a maxBatchSize of 100.

When an alert is placed into the buffering queue, it receives a default automatic resolve time of current time + 3 minutes. Two stages can affect processing: the buffering queue stage and the network latency stage. Network latency was ruled out, so the buffering queue was examined.

If an alert in the local queue has not been sent for more than 1 minute, it is moved to the buffering queue, which then pushes up to maxBatchSize alerts to Alertmanager. High volumes of duplicate alerts can reduce the push frequency, causing the queue to accumulate.

Log analysis showed the queue pushed roughly every 10 seconds, meaning a single alert could remain in the queue for about (capacity / maxBatchSize) × 10 s ≈ 16 minutes, exceeding the default 3‑minute automatic resolve time.

Three Thanos metrics (thanos_alert_queue_alerts_dropped_total, thanos_alert_queue_alerts_pushed_total, thanos_alert_queue_alerts_popped_total) confirmed queue saturation and alert loss.

The root cause was identified as the buffering queue’s low push frequency combined with a small maxBatchSize, leading to alert delay. The solution is to increase maxBatchSize based on the maximum number of monitored entities (e.g., the number of MySQL instances) using the formula:

maxBatchSize >= (x1*y1 + x2*y2 + ... + xn*yn) / (y1 + y2 + ... + yn)

In practice, setting maxBatchSize to the maximum count of a single monitored entity type (e.g., the highest number of MySQL instances) prevents queue buildup.

When using the integrated Ruler component in DMP, this parameter can be adjusted directly; for upstream Thanos Ruler, source code modifications in cmd/thanos/rule.go are required.

Source file: https://github.com/thanos-io/thanos/blob/master/cmd/thanos/rule.go

References:

1. https://thanos.io/tip/thanos/design.md 2. https://thanos.io/tip/components/rule.md

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring alerting BufferQueue Alertmanager Thanos RULER

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.