How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges
This article explains how the Thanos and Prometheus combination provides long‑term, highly available monitoring for massive OpenStack and Ceph clusters, detailing its features, architecture, key components, practical deployment issues, and the operational problems it resolves.
As OpenStack clusters grow, monitoring data expands exponentially, stressing storage and query performance. Traditional OpenStack projects like Ceilometer, Gnocchi, and Aodh cannot fully address these challenges, prompting the use of the CNCF solution Thanos + Prometheus for long‑term, high‑availability monitoring.
1. What is Thanos
Improbable, a UK game‑tech company, open‑sourced a highly available Prometheus setup that offers long‑term storage capabilities.
2. Features of Thanos
Provides a unified query interface across multiple Prometheus instances.
Enables indefinite storage of metrics using object storage such as S3, Azure, Tencent COS, GCP, OpenStack Swift, etc.
Maintains compatibility with existing Prometheus APIs, allowing tools like Grafana to operate unchanged.
Offers data compression and down‑sampling to accelerate queries.
Deduplicates and merges data collected from HA Prometheus clusters.
3. Architecture
The overall architecture consists of several components that work together to store, query, and manage monitoring data.
4. Core Components
Compact
Provides down‑sampling and compression of data stored in object storage, merging historic blocks into larger files. While compression does not reduce raw size, it improves query speed for historical data. Sufficient local disk space (e.g., 300 GB) is required for intermediate processing.
Querier
Implements the Prometheus HTTP v1 API, handling incoming PromQL queries and aggregating results. It is stateless and horizontally scalable.
Sidecar
Deployed alongside each Prometheus instance, the Sidecar proxies queries to the Querier and continuously uploads newly generated local monitoring data to the configured object storage.
Store
Provides historical data retrieval. When Querier requests data, Store fetches the appropriate objects from storage and converts them into a format consumable by Querier.
Bucket
Utility for inspecting objects in the storage bucket, often used for troubleshooting via a web UI.
Check
Validates Prometheus rule files. Example implementation:
<code>// Define rule‑checking function
func checkRules(logger log.Logger, filename string) (int, errors.MultiError) {
level.Info(logger).Log("msg", "checking", "filename", filename)
checkErrors := errors.MultiError{}
b, err := ioutil.ReadFile(filename)
if err != nil {
checkErrors.Add(err)
return 0, checkErrors
}
var rgs ThanosRuleGroups
if err := yaml.UnmarshalStrict(b, &rgs); err != nil {
checkErrors.Add(err)
return 0, checkErrors
}
promRgs := thanosRuleGroupsToPromRuleGroups(rgs)
if errs := promRgs.Validate(); errs != nil {
for _, e := range errs {
checkErrors.Add(e)
}
return 0, checkErrors
}
numRules := 0
for _, rg := range rgs.Groups {
numRules += len(rg.Rules)
}
return numRules, checkErrors
}
</code>5. Practical Issues Encountered
When the Store component loads metadata from object storage, local disk and memory usage can grow dramatically, leading to OOM errors. Early deployments on Kubernetes Pods suffered IP changes causing split‑brain scenarios. Moving Store to physical machines and upgrading to newer versions with compression mitigated these problems.
6. Summary
Thanos adds long‑term storage and high availability to Prometheus without invasive changes, but it relies heavily on object storage resources. In production, the solution monitors over 40 OpenStack and 70 Ceph clusters, encompassing more than 10 000 OSD nodes and generating roughly 50 GB of monitoring data per day.
Problems Solved by Thanos
Overcoming storage limits that restrict historical data retention.
Alleviating query performance degradation as the number of clusters grows.
Providing a unified interface to query and alert across numerous OpenStack and Ceph clusters.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.