Implementing Multi‑Cluster Monitoring with Prometheus and Thanos on Kubernetes
This article explains the limitations of a standard Prometheus monitoring stack on Kubernetes and demonstrates how to migrate to a Thanos‑based solution for long‑term metric retention, reduced infrastructure cost, and scalable multi‑cluster observability using Terraform and cloud‑native components.
In this article we examine the limitations of the Prometheus monitoring stack and why moving to a Thanos‑based stack can improve metric retention and lower overall infrastructure costs.
The standard Kubernetes Prometheus stack typically includes Prometheus for metric collection, Alertmanager for alert routing, and Grafana for visualization. However, this architecture faces scalability challenges as the number of clusters grows, and storing metric data on disk can become expensive.
To address these issues we explore several solutions:
Multiple Grafana data sources pointing to external Prometheus endpoints with TLS and basic authentication.
Prometheus federation for selective metric scraping.
Prometheus remote write (not covered in depth here).
We then introduce Thanos, an open‑source, highly available Prometheus system with long‑term storage capabilities. Thanos consists of several components that communicate via gRPC:
Thanos Sidecar runs alongside Prometheus, uploading metrics to object storage every two hours, making Prometheus effectively stateless.
Thanos Store acts as a gateway, converting queries to remote object storage.
Thanos Compactor deduplicates and down‑samples data in object storage, reducing storage costs.
Thanos Query provides a PromQL‑compatible endpoint that aggregates queries across multiple stores.
Thanos Query Frontend splits large queries into smaller ones and caches results.
We demonstrate a multi‑cluster deployment on AWS using two EKS clusters (an observer cluster and an observed cluster). The observer cluster runs the full monitoring stack with Grafana, while the observed cluster runs a minimal Prometheus/Thanos installation. Terraform modules ( kube-prometheus-stack and bitnami thanos ) are used to provision the components.
├── env_tags.yaml
├── eu-west-1
│ ├── clusters
│ │ └── observer
│ │ ├── eks
│ │ │ ├── kubeconfig
│ │ │ └── terragrunt.hcl
│ │ ├── eks-addons
│ │ │ └── terragrunt.hcl
│ │ └── vpc
│ │ └── terragrunt.hcl
│ └── region_values.yaml
└── eu-west-3
└── ...Key Terraform snippets configure the kube-prometheus-stack with Thanos sidecar enabled, TLS certificates, and ingress settings for Grafana and Thanos components.
kube-prometheus-stack = {
enabled = true
thanos_sidecar_enabled = true
extra_values = <<-EXTRA_VALUES
grafana:
deploymentStrategy:
type: Recreate
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: "letsencrypt"
hosts:
- grafana.thanos.example.com
tls:
- secretName: grafana.thanos.example.com
hosts:
- grafana.thanos.example.com
prometheus:
prometheusSpec:
replicas: 1
retention: 2d
retentionSize: "10GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ebs-sc
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
EXTRA_VALUES
}After deploying, we verify the pods and ingresses in both clusters using kubectl -n monitoring get pods and kubectl -n monitoring get ingress . Logs from the TLS querier show successful addition of new store APIs.
level=info ts=2021-02-23T15:37:35.692346206Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=thanos-sidecar.thanos.example.com:443 extLset="{cluster=\"pio-thanos-observee\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-0\"}"Port‑forward commands allow us to access the Thanos querier UI and Grafana dashboards, confirming that metrics from multiple clusters are aggregated and visualized correctly.
In summary, Thanos provides a complex but powerful system for scalable, long‑term monitoring. The provided Terraform repository abstracts much of the complexity, especially the mTLS setup, and can be extended to other cloud providers.
For deeper exploration, refer to the official kube-thanos repository and its recommendations for cross‑cluster communication.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.