Cloud Native 7 min read

Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters

The article explains how Thanos combined with Prometheus provides a cloud‑native, highly available solution for long‑term metric storage and fast querying to address the exponential growth of monitoring data in large OpenStack and Ceph deployments.

360 Tech Engineering

Dec 23, 2019

Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters

As OpenStack clusters grow, monitoring data expands exponentially, challenging storage and query performance; the article introduces Thanos combined with Prometheus as a cloud‑native, highly available solution for long‑term metric storage and fast querying.

Key features of Thanos include cross‑Prometheus federation, indefinite metric retention on object stores (S3, Azure, GCP, OpenStack Swift, etc.), compatibility with existing Prometheus APIs, data compression and down‑sampling, and deduplication of HA data.

The architecture consists of several components:

Compact : compresses and down‑samples blocks in object storage, requiring ample local disk space for intermediate data.

Querier : stateless service exposing the Prometheus HTTP API and aggregating results from multiple stores.

Sidecar : runs alongside each Prometheus instance, uploads local TSDB blocks to object storage and proxies queries to the local Prometheus.

Store : serves historical data to Querier by fetching and reformatting blocks from object storage.

Bucket : provides command‑line tools for inspecting object‑store contents and troubleshooting.

Check : validates Prometheus rule files; an example Go function

func checkRules(logger log.Logger, filename string) (int, errors.MultiError) { level.Info(logger).Log("msg", "checking", "filename", filename); checkErrors := errors.MultiError{}; b, err := ioutil.ReadFile(filename); if err != nil { checkErrors.Add(err); return 0, checkErrors } var rgs ThanosRuleGroups; if err := yaml.UnmarshalStrict(b, &rgs); err != nil { checkErrors.Add(err); return 0, checkErrors } promRgs := thanosRuleGroupsToPromRuleGroups(rgs); if errs := promRgs.Validate(); errs != nil { for _, e := range errs { checkErrors.Add(e) } return 0, checkErrors } numRules := 0; for _, rg := range rgs.Groups { numRules += len(rg.Rules) } return numRules, checkErrors }

The article also shares practical issues encountered, such as Store OOM due to loading many block metadata, pod IP changes causing split‑brain, and the decision to run Store on bare metal while keeping Sidecar and Prometheus in pods. Enabling compression reduced query latency.

Deployment now monitors Ceph/CephFS, LVS, OpenStack, Etcd, Kubernetes, Istio, and OpenStack VMs, exposing an API that integrates with StackStorm for automated event handling. The current production environment runs over 40 OpenStack and 70 Ceph clusters, ~10,000 OSDs, generating about 50 GB of metrics per day.

In summary, Thanos adds long‑term storage and high‑availability to Prometheus without invasive changes, but it relies heavily on object‑store resources; it successfully addresses historical data retention, query performance at scale, and unified monitoring across large OpenStack and Ceph deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native Prometheus OpenStack Thanos

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.