Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters
The article explains how Thanos combined with Prometheus provides a cloud‑native, highly available solution for long‑term metric storage and fast querying to address the exponential growth of monitoring data in large OpenStack and Ceph deployments.
As OpenStack clusters grow, monitoring data expands exponentially, challenging storage and query performance; the article introduces Thanos combined with Prometheus as a cloud‑native, highly available solution for long‑term metric storage and fast querying.
Key features of Thanos include cross‑Prometheus federation, indefinite metric retention on object stores (S3, Azure, GCP, OpenStack Swift, etc.), compatibility with existing Prometheus APIs, data compression and down‑sampling, and deduplication of HA data.
The architecture consists of several components:
Compact : compresses and down‑samples blocks in object storage, requiring ample local disk space for intermediate data.
Querier : stateless service exposing the Prometheus HTTP API and aggregating results from multiple stores.
Sidecar : runs alongside each Prometheus instance, uploads local TSDB blocks to object storage and proxies queries to the local Prometheus.
Store : serves historical data to Querier by fetching and reformatting blocks from object storage.
Bucket : provides command‑line tools for inspecting object‑store contents and troubleshooting.
Check : validates Prometheus rule files; an example Go function func checkRules(logger log.Logger, filename string) (int, errors.MultiError) { level.Info(logger).Log("msg", "checking", "filename", filename); checkErrors := errors.MultiError{}; b, err := ioutil.ReadFile(filename); if err != nil { checkErrors.Add(err); return 0, checkErrors } var rgs ThanosRuleGroups; if err := yaml.UnmarshalStrict(b, &rgs); err != nil { checkErrors.Add(err); return 0, checkErrors } promRgs := thanosRuleGroupsToPromRuleGroups(rgs); if errs := promRgs.Validate(); errs != nil { for _, e := range errs { checkErrors.Add(e) } return 0, checkErrors } numRules := 0; for _, rg := range rgs.Groups { numRules += len(rg.Rules) } return numRules, checkErrors } .
The article also shares practical issues encountered, such as Store OOM due to loading many block metadata, pod IP changes causing split‑brain, and the decision to run Store on bare metal while keeping Sidecar and Prometheus in pods. Enabling compression reduced query latency.
Deployment now monitors Ceph/CephFS, LVS, OpenStack, Etcd, Kubernetes, Istio, and OpenStack VMs, exposing an API that integrates with StackStorm for automated event handling. The current production environment runs over 40 OpenStack and 70 Ceph clusters, ~10,000 OSDs, generating about 50 GB of metrics per day.
In summary, Thanos adds long‑term storage and high‑availability to Prometheus without invasive changes, but it relies heavily on object‑store resources; it successfully addresses historical data retention, query performance at scale, and unified monitoring across large OpenStack and Ceph deployments.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.