Operations 16 min read

Migrating from Prometheus to Thanos for Scalable, Cost‑Effective Monitoring on Kubernetes

This article explains the limitations of a traditional Prometheus monitoring stack, demonstrates how Thanos provides unlimited long‑term storage and lower infrastructure costs, and walks through a complete multi‑cluster deployment on Kubernetes using Terraform and AWS.

Java Captain
Java Captain
Java Captain
Migrating from Prometheus to Thanos for Scalable, Cost‑Effective Monitoring on Kubernetes

In this article we examine the shortcomings of the classic Prometheus monitoring stack and why moving to a Thanos‑based stack can improve metric retention while reducing overall infrastructure cost.

The demo material referenced in the article is available in the linked GitHub repositories.

Kubernetes Prometheus Stack

When deploying Kubernetes infrastructure for our customers, a monitoring stack is installed on each cluster. The stack typically consists of:

Prometheus – collects metrics

Alertmanager – sends alerts based on metric queries

Grafana – visualises dashboards

A simplified architecture diagram is shown below:

There are several practical concerns with this design:

Each cluster runs its own Grafana and set of dashboards, making maintenance cumbersome.

Prometheus stores metrics on local disks, forcing a trade‑off between storage size and retention period; long‑term storage on cloud block devices can become very expensive.

Running replication or sharding in production can double or quadruple storage requirements.

Possible solutions include:

Using a single Grafana instance with multiple data sources that point to external Prometheus endpoints secured with TLS.

Prometheus federation for modest metric volumes.

Prometheus remote write (implemented by Thanos receiver) – not covered in depth here.

Thanos, It’s Here

Thanos is an open‑source, highly available Prometheus system with long‑term storage capabilities. It stores data in object storage (e.g., S3, MinIO) providing effectively unlimited retention.

Thanos is composed of several components that communicate via gRPC:

Thanos Sidecar – runs alongside Prometheus, uploads metrics to object storage every two hours, making Prometheus almost stateless.

Thanos Store – acts as a gateway that queries object storage and caches data locally.

Thanos Compactor – a singleton that down‑samples and compresses stored metrics to save space.

Thanos Query – the central query component exposing a PromQL‑compatible endpoint and dispatching queries to all stores.

Thanos Query Frontend – splits large queries into smaller ones and caches results.

Multi‑Cluster Architecture

There are many ways to deploy these components across multiple Kubernetes clusters; the article presents one example using two AWS EKS clusters (an observer and an observee) managed by the tEKS repository.

The directory layout of the demo repository is shown below:

.
├── env_tags.yaml
├── eu-west-1
│   └── clusters
│       └── observer
│           ├── eks
│           │   ├── kubeconfig
│           │   └── terragrunt.hcl
│           ├── eks-addons
│           │   └── terragrunt.hcl
│           └── vpc
│               └── terragrunt.hcl
│   └── region_values.yaml
└── eu-west-3
    └── clusters
        └── observee
            ├── cluster_values.yaml
            ├── eks
            │   ├── kubeconfig
            │   └── terragrunt.hcl
            ├── eks-addons
            │   └── terragrunt.hcl
            └── vpc
                └── terragrunt.hcl
    └── region_values.yaml

This DRY (Don’t Repeat Yourself) infrastructure makes it easy to scale the number of AWS accounts, regions, and clusters.

Observer cluster runs the full monitoring stack (Prometheus, Grafana, Thanos sidecar) and uploads metrics to a dedicated bucket. TLS certificates are generated so that the sidecar trusts the observer’s CA.

Observee cluster runs a minimal Prometheus/Thanos installation that is queried by the observer.

Example Terraform configuration for the kube‑prometheus‑stack chart (observer cluster) is:

kube-prometheus-stack = {
  enabled                 = true
  allowed_cidrs           = dependency.vpc.outputs.private_subnets_cidr_blocks
  thanos_sidecar_enabled  = true
  thanos_bucket_force_destroy = true
  extra_values = <<-EXTRA_VALUES
    grafana:
      deploymentStrategy:
        type: Recreate
      ingress:
        enabled: true
        annotations:
          kubernetes.io/ingress.class: nginx
          cert-manager.io/cluster-issuer: "letsencrypt"
        hosts:
          - grafana.${local.default_domain_suffix}
        tls:
          - secretName: grafana.${local.default_domain_suffix}
            hosts:
              - grafana.${local.default_domain_suffix}
      persistence:
        enabled: true
        storageClassName: ebs-sc
        accessModes: [ReadWriteOnce]
        size: 1Gi
    prometheus:
      prometheusSpec:
        replicas: 1
        retention: 2d
        retentionSize: "10GB"
        ruleSelectorNilUsesHelmValues: false
        serviceMonitorSelectorNilUsesHelmValues: false
        podMonitorSelectorNilUsesHelmValues: false
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ebs-sc
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi
  EXTRA_VALUES
}

TLS querier and store‑gateway configuration for the observee cluster:

thanos-tls-querier = {
  "observee" = {
    enabled                 = true
    default_global_requests = true
    default_global_limits   = false
    stores = ["thanos-sidecar.${local.default_domain_suffix}:443"]
  }
}

thanos-storegateway = {
  "observee" = {
    enabled                 = true
    default_global_requests = true
    default_global_limits   = false
    bucket                  = "thanos-store-pio-thanos-observee"
    region                  = "eu-west-3"
  }
}

Thanos component deployment for the observer cluster:

thanos = {
  enabled = true
  bucket_force_destroy = true
  trusted_ca_content = dependency.thanos-ca.outputs.thanos_ca
  extra_values = <<-EXTRA_VALUES
    compactor:
      retentionResolution5m: 90d
    query:
      enabled: false
    queryFrontend:
      enabled: false
    storegateway:
      enabled: false
  EXTRA_VALUES
}

Running kubectl -n monitoring get pods on the observer cluster shows all Prometheus, Grafana, and Thanos components (sidecar, compactor, query, query‑frontend, store‑gateway, TLS querier). Similar commands on the observee cluster list its minimal set of pods.

Logs from the TLS querier confirm that it successfully adds the remote sidecar as a store endpoint. Port‑forwarding the TLS querier and the standard Thanos query demonstrates that both can query metrics from the observee cluster.

The final Grafana view shows the default Kubernetes dashboards working across the multi‑cluster setup.

Conclusion

Thanos is a complex system with many moving parts; this article only scratches the surface of its configuration. The tEKS repository provides a fairly complete AWS implementation that abstracts much of the complexity (especially mTLS) and is highly customizable. The terraform‑kubernetes‑addons modules can also be used independently, and future support for other cloud providers is planned. Feel free to open issues on the GitHub projects for assistance.

For deeper exploration, consult the official Thanos cross‑cluster TLS communication documentation and the kube‑thanos repository.

monitoringcloud nativeObservabilityKubernetesPrometheusTerraformThanos
Java Captain
Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.