Operations 8 min read

How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges

This article explains how the Thanos and Prometheus combination provides long‑term, highly available monitoring for massive OpenStack and Ceph clusters, detailing its features, architecture, key components, practical deployment issues, and the operational problems it resolves.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges

As OpenStack clusters grow, monitoring data expands exponentially, stressing storage and query performance. Traditional OpenStack projects like Ceilometer, Gnocchi, and Aodh cannot fully address these challenges, prompting the use of the CNCF solution Thanos + Prometheus for long‑term, high‑availability monitoring.

1. What is Thanos

Improbable, a UK game‑tech company, open‑sourced a highly available Prometheus setup that offers long‑term storage capabilities.

2. Features of Thanos

Provides a unified query interface across multiple Prometheus instances.

Enables indefinite storage of metrics using object storage such as S3, Azure, Tencent COS, GCP, OpenStack Swift, etc.

Maintains compatibility with existing Prometheus APIs, allowing tools like Grafana to operate unchanged.

Offers data compression and down‑sampling to accelerate queries.

Deduplicates and merges data collected from HA Prometheus clusters.

3. Architecture

The overall architecture consists of several components that work together to store, query, and manage monitoring data.

4. Core Components

Compact

Provides down‑sampling and compression of data stored in object storage, merging historic blocks into larger files. While compression does not reduce raw size, it improves query speed for historical data. Sufficient local disk space (e.g., 300 GB) is required for intermediate processing.

Querier

Implements the Prometheus HTTP v1 API, handling incoming PromQL queries and aggregating results. It is stateless and horizontally scalable.

Sidecar

Deployed alongside each Prometheus instance, the Sidecar proxies queries to the Querier and continuously uploads newly generated local monitoring data to the configured object storage.

Store

Provides historical data retrieval. When Querier requests data, Store fetches the appropriate objects from storage and converts them into a format consumable by Querier.

Bucket

Utility for inspecting objects in the storage bucket, often used for troubleshooting via a web UI.

Check

Validates Prometheus rule files. Example implementation:

<code>// Define rule‑checking function
func checkRules(logger log.Logger, filename string) (int, errors.MultiError) {
    level.Info(logger).Log("msg", "checking", "filename", filename)
    checkErrors := errors.MultiError{}
    b, err := ioutil.ReadFile(filename)
    if err != nil {
        checkErrors.Add(err)
        return 0, checkErrors
    }
    var rgs ThanosRuleGroups
    if err := yaml.UnmarshalStrict(b, &rgs); err != nil {
        checkErrors.Add(err)
        return 0, checkErrors
    }
    promRgs := thanosRuleGroupsToPromRuleGroups(rgs)
    if errs := promRgs.Validate(); errs != nil {
        for _, e := range errs {
            checkErrors.Add(e)
        }
        return 0, checkErrors
    }
    numRules := 0
    for _, rg := range rgs.Groups {
        numRules += len(rg.Rules)
    }
    return numRules, checkErrors
}
</code>

5. Practical Issues Encountered

When the Store component loads metadata from object storage, local disk and memory usage can grow dramatically, leading to OOM errors. Early deployments on Kubernetes Pods suffered IP changes causing split‑brain scenarios. Moving Store to physical machines and upgrading to newer versions with compression mitigated these problems.

6. Summary

Thanos adds long‑term storage and high availability to Prometheus without invasive changes, but it relies heavily on object storage resources. In production, the solution monitors over 40 OpenStack and 70 Ceph clusters, encompassing more than 10 000 OSD nodes and generating roughly 50 GB of monitoring data per day.

Problems Solved by Thanos

Overcoming storage limits that restrict historical data retention.

Alleviating query performance degradation as the number of clusters grows.

Providing a unified interface to query and alert across numerous OpenStack and Ceph clusters.

monitoringObservabilityPrometheusCephOpenStackThanos
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.