Migrating from Prometheus to Thanos for Scalable, Cost‑Effective Monitoring on Kubernetes
This article explains the limitations of a traditional Prometheus monitoring stack, demonstrates how Thanos provides unlimited long‑term storage and lower infrastructure costs, and walks through a complete multi‑cluster deployment on Kubernetes using Terraform and AWS.
In this article we examine the shortcomings of the classic Prometheus monitoring stack and why moving to a Thanos‑based stack can improve metric retention while reducing overall infrastructure cost.
The demo material referenced in the article is available in the linked GitHub repositories.
Kubernetes Prometheus Stack
When deploying Kubernetes infrastructure for our customers, a monitoring stack is installed on each cluster. The stack typically consists of:
Prometheus – collects metrics
Alertmanager – sends alerts based on metric queries
Grafana – visualises dashboards
A simplified architecture diagram is shown below:
There are several practical concerns with this design:
Each cluster runs its own Grafana and set of dashboards, making maintenance cumbersome.
Prometheus stores metrics on local disks, forcing a trade‑off between storage size and retention period; long‑term storage on cloud block devices can become very expensive.
Running replication or sharding in production can double or quadruple storage requirements.
Possible solutions include:
Using a single Grafana instance with multiple data sources that point to external Prometheus endpoints secured with TLS.
Prometheus federation for modest metric volumes.
Prometheus remote write (implemented by Thanos receiver) – not covered in depth here.
Thanos, It’s Here
Thanos is an open‑source, highly available Prometheus system with long‑term storage capabilities. It stores data in object storage (e.g., S3, MinIO) providing effectively unlimited retention.
Thanos is composed of several components that communicate via gRPC:
Thanos Sidecar – runs alongside Prometheus, uploads metrics to object storage every two hours, making Prometheus almost stateless.
Thanos Store – acts as a gateway that queries object storage and caches data locally.
Thanos Compactor – a singleton that down‑samples and compresses stored metrics to save space.
Thanos Query – the central query component exposing a PromQL‑compatible endpoint and dispatching queries to all stores.
Thanos Query Frontend – splits large queries into smaller ones and caches results.
Multi‑Cluster Architecture
There are many ways to deploy these components across multiple Kubernetes clusters; the article presents one example using two AWS EKS clusters (an observer and an observee) managed by the tEKS repository.
The directory layout of the demo repository is shown below:
.
├── env_tags.yaml
├── eu-west-1
│ └── clusters
│ └── observer
│ ├── eks
│ │ ├── kubeconfig
│ │ └── terragrunt.hcl
│ ├── eks-addons
│ │ └── terragrunt.hcl
│ └── vpc
│ └── terragrunt.hcl
│ └── region_values.yaml
└── eu-west-3
└── clusters
└── observee
├── cluster_values.yaml
├── eks
│ ├── kubeconfig
│ └── terragrunt.hcl
├── eks-addons
│ └── terragrunt.hcl
└── vpc
└── terragrunt.hcl
└── region_values.yamlThis DRY (Don’t Repeat Yourself) infrastructure makes it easy to scale the number of AWS accounts, regions, and clusters.
Observer cluster runs the full monitoring stack (Prometheus, Grafana, Thanos sidecar) and uploads metrics to a dedicated bucket. TLS certificates are generated so that the sidecar trusts the observer’s CA.
Observee cluster runs a minimal Prometheus/Thanos installation that is queried by the observer.
Example Terraform configuration for the kube‑prometheus‑stack chart (observer cluster) is:
kube-prometheus-stack = {
enabled = true
allowed_cidrs = dependency.vpc.outputs.private_subnets_cidr_blocks
thanos_sidecar_enabled = true
thanos_bucket_force_destroy = true
extra_values = <<-EXTRA_VALUES
grafana:
deploymentStrategy:
type: Recreate
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: "letsencrypt"
hosts:
- grafana.${local.default_domain_suffix}
tls:
- secretName: grafana.${local.default_domain_suffix}
hosts:
- grafana.${local.default_domain_suffix}
persistence:
enabled: true
storageClassName: ebs-sc
accessModes: [ReadWriteOnce]
size: 1Gi
prometheus:
prometheusSpec:
replicas: 1
retention: 2d
retentionSize: "10GB"
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ebs-sc
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
EXTRA_VALUES
}TLS querier and store‑gateway configuration for the observee cluster:
thanos-tls-querier = {
"observee" = {
enabled = true
default_global_requests = true
default_global_limits = false
stores = ["thanos-sidecar.${local.default_domain_suffix}:443"]
}
}
thanos-storegateway = {
"observee" = {
enabled = true
default_global_requests = true
default_global_limits = false
bucket = "thanos-store-pio-thanos-observee"
region = "eu-west-3"
}
}Thanos component deployment for the observer cluster:
thanos = {
enabled = true
bucket_force_destroy = true
trusted_ca_content = dependency.thanos-ca.outputs.thanos_ca
extra_values = <<-EXTRA_VALUES
compactor:
retentionResolution5m: 90d
query:
enabled: false
queryFrontend:
enabled: false
storegateway:
enabled: false
EXTRA_VALUES
}Running kubectl -n monitoring get pods on the observer cluster shows all Prometheus, Grafana, and Thanos components (sidecar, compactor, query, query‑frontend, store‑gateway, TLS querier). Similar commands on the observee cluster list its minimal set of pods.
Logs from the TLS querier confirm that it successfully adds the remote sidecar as a store endpoint. Port‑forwarding the TLS querier and the standard Thanos query demonstrates that both can query metrics from the observee cluster.
The final Grafana view shows the default Kubernetes dashboards working across the multi‑cluster setup.
Conclusion
Thanos is a complex system with many moving parts; this article only scratches the surface of its configuration. The tEKS repository provides a fairly complete AWS implementation that abstracts much of the complexity (especially mTLS) and is highly customizable. The terraform‑kubernetes‑addons modules can also be used independently, and future support for other cloud providers is planned. Feel free to open issues on the GitHub projects for assistance.
For deeper exploration, consult the official Thanos cross‑cluster TLS communication documentation and the kube‑thanos repository.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.