Cloud Native 15 min read

Boost Kubernetes Monitoring: Why Switch from Prometheus to Thanos for Scalable, Cost‑Effective Metrics

This article explores the limitations of a Prometheus‑based monitoring stack and demonstrates how adopting a Thanos‑based architecture improves metric retention, enables multi‑cluster querying, and reduces overall infrastructure costs while providing a scalable, cloud‑native solution.

Efficient Ops
Efficient Ops
Efficient Ops
Boost Kubernetes Monitoring: Why Switch from Prometheus to Thanos for Scalable, Cost‑Effective Metrics

Introduction

In this article we examine the limitations of a Prometheus‑based monitoring stack and explain why moving to a Thanos‑based stack can improve metric retention and reduce overall infrastructure cost.

Demo resources are available at the links below.

https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos

https://github.com/particuleio/terraform-kubernetes-addons/tree/main/modules/aws

Kubernetes Monitoring Stack

When deploying Kubernetes for customers, a standard monitoring stack consists of Prometheus (metrics collection), Alertmanager (alert routing), and Grafana (visual dashboards).

Simplified architecture:

Considerations

The architecture does not scale well when the number of clusters increases. Multiple Grafana instances increase maintenance overhead. Storing metrics on local disks forces a trade‑off between storage size and retention period, leading to high costs at scale.

Solution

Multiple Grafana data sources – expose Prometheus endpoints externally and add them as data sources to a single Grafana, securing with TLS or basic authentication.

Prometheus federation – scrape metrics from other Prometheus instances when the scrape volume is low.

Prometheus remote write – not covered in detail here; push‑based metrics are a separate topic.

Thanos, It’s Here

Thanos is an open‑source, highly‑available Prometheus system with long‑term storage. It stores metrics in object storage (e.g., S3) and makes the Prometheus sidecar upload data every two hours, making Prometheus effectively stateless.

Thanos components communicate via gRPC and include:

Thanos Sidecar

Thanos Store

Thanos Query

Thanos Compactor

Thanos Query Frontend

Each component’s role is described briefly.

Multi‑Cluster Architecture

We deploy two EKS clusters (observer and observee) using the official kube‑prometheus‑stack and Bitnami Thanos charts. The repository provides a DRY Terraform layout that can scale across AWS accounts, regions, and clusters.

<code>.\n├── env_tags.yaml\n├── eu-west-1\n│  ├── clusters\n│  │  └── observer\n│  │      ├── eks\n│  │      │  ├── kubeconfig\n│  │      │  └── terragrunt.hcl\n│  │      ├── eks-addons\n│  │      │  └── terragrunt.hcl\n│  │      └── vpc\n│  │          └── terragrunt.hcl\n│  └── region_values.yaml\n└── eu-west-3\n   ├── clusters\n   │  └── observee\n   │      ├── cluster_values.yaml\n   │      ├── eks\n   │      │  ├── kubeconfig\n   │      │  └── terragrunt.hcl\n   │      ├── eks-addons\n   │      │  └── terragrunt.hcl\n   │      └── vpc\n   │          └── terragrunt.hcl\n   └── region_values.yaml</code>

Observer cluster runs Grafana, Prometheus, and Thanos components; observee cluster runs a minimal stack.

<code>kubectl -n monitoring get pods\nNAME                                            READY   STATUS    RESTARTS   AGE\nalertmanager-kube-prometheus-stack-alertmanager-0   2/2     Running   0          120m\nkube-prometheus-stack-grafana-c8768466b-rd8wm    2/2     Running   0          120m\n... (additional pod list) ...\nthanos-query-7c74db546c-d7bp8                     1/1     Running   0          12m\nthanos-storegateway-0                              1/1     Running   0          119m</code>

Verification

Logs show the TLS querier adding remote stores, and port‑forward commands demonstrate query access.

<code>level=info ts=2021-02-23T15:37:35.692346206Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=thanos-sidecar.thanos.teks-tg.clusterfrak-dynamics.io:443 extLset="{cluster=\"pio-thanos-observee\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-0\"}"</code>

Grafana Visualization

Grafana dashboards can now query across clusters, providing a unified view of Kubernetes metrics.

Conclusion

Thanos adds complexity but offers scalable, long‑term storage and multi‑cluster querying. The provided Terraform modules abstract much of the setup, and the solution can be adapted to other clouds.

monitoringcloud nativeKubernetesmulti-clusterPrometheusTerraformThanos
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.