Boost Kubernetes Monitoring: Why Switch from Prometheus to Thanos for Scalable, Cost‑Effective Metrics
This article explores the limitations of a Prometheus‑based monitoring stack and demonstrates how adopting a Thanos‑based architecture improves metric retention, enables multi‑cluster querying, and reduces overall infrastructure costs while providing a scalable, cloud‑native solution.
Introduction
In this article we examine the limitations of a Prometheus‑based monitoring stack and explain why moving to a Thanos‑based stack can improve metric retention and reduce overall infrastructure cost.
Demo resources are available at the links below.
https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos
https://github.com/particuleio/terraform-kubernetes-addons/tree/main/modules/aws
Kubernetes Monitoring Stack
When deploying Kubernetes for customers, a standard monitoring stack consists of Prometheus (metrics collection), Alertmanager (alert routing), and Grafana (visual dashboards).
Simplified architecture:
Considerations
The architecture does not scale well when the number of clusters increases. Multiple Grafana instances increase maintenance overhead. Storing metrics on local disks forces a trade‑off between storage size and retention period, leading to high costs at scale.
Solution
Multiple Grafana data sources – expose Prometheus endpoints externally and add them as data sources to a single Grafana, securing with TLS or basic authentication.
Prometheus federation – scrape metrics from other Prometheus instances when the scrape volume is low.
Prometheus remote write – not covered in detail here; push‑based metrics are a separate topic.
Thanos, It’s Here
Thanos is an open‑source, highly‑available Prometheus system with long‑term storage. It stores metrics in object storage (e.g., S3) and makes the Prometheus sidecar upload data every two hours, making Prometheus effectively stateless.
Thanos components communicate via gRPC and include:
Thanos Sidecar
Thanos Store
Thanos Query
Thanos Compactor
Thanos Query Frontend
Each component’s role is described briefly.
Multi‑Cluster Architecture
We deploy two EKS clusters (observer and observee) using the official kube‑prometheus‑stack and Bitnami Thanos charts. The repository provides a DRY Terraform layout that can scale across AWS accounts, regions, and clusters.
<code>.\n├── env_tags.yaml\n├── eu-west-1\n│ ├── clusters\n│ │ └── observer\n│ │ ├── eks\n│ │ │ ├── kubeconfig\n│ │ │ └── terragrunt.hcl\n│ │ ├── eks-addons\n│ │ │ └── terragrunt.hcl\n│ │ └── vpc\n│ │ └── terragrunt.hcl\n│ └── region_values.yaml\n└── eu-west-3\n ├── clusters\n │ └── observee\n │ ├── cluster_values.yaml\n │ ├── eks\n │ │ ├── kubeconfig\n │ │ └── terragrunt.hcl\n │ ├── eks-addons\n │ │ └── terragrunt.hcl\n │ └── vpc\n │ └── terragrunt.hcl\n └── region_values.yaml</code>Observer cluster runs Grafana, Prometheus, and Thanos components; observee cluster runs a minimal stack.
<code>kubectl -n monitoring get pods\nNAME READY STATUS RESTARTS AGE\nalertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 120m\nkube-prometheus-stack-grafana-c8768466b-rd8wm 2/2 Running 0 120m\n... (additional pod list) ...\nthanos-query-7c74db546c-d7bp8 1/1 Running 0 12m\nthanos-storegateway-0 1/1 Running 0 119m</code>Verification
Logs show the TLS querier adding remote stores, and port‑forward commands demonstrate query access.
<code>level=info ts=2021-02-23T15:37:35.692346206Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=thanos-sidecar.thanos.teks-tg.clusterfrak-dynamics.io:443 extLset="{cluster=\"pio-thanos-observee\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-0\"}"</code>Grafana Visualization
Grafana dashboards can now query across clusters, providing a unified view of Kubernetes metrics.
Conclusion
Thanos adds complexity but offers scalable, long‑term storage and multi‑cluster querying. The provided Terraform modules abstract much of the setup, and the solution can be adapted to other clouds.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.