Cloud Native 15 min read

Implementing Multi‑Cluster Monitoring with Prometheus and Thanos on Kubernetes

This article explains the limitations of a standard Prometheus monitoring stack on Kubernetes and demonstrates how to migrate to a Thanos‑based solution for long‑term metric retention, reduced infrastructure cost, and scalable multi‑cluster observability using Terraform and cloud‑native components.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Implementing Multi‑Cluster Monitoring with Prometheus and Thanos on Kubernetes

In this article we examine the limitations of the Prometheus monitoring stack and why moving to a Thanos‑based stack can improve metric retention and lower overall infrastructure costs.

The standard Kubernetes Prometheus stack typically includes Prometheus for metric collection, Alertmanager for alert routing, and Grafana for visualization. However, this architecture faces scalability challenges as the number of clusters grows, and storing metric data on disk can become expensive.

To address these issues we explore several solutions:

Multiple Grafana data sources pointing to external Prometheus endpoints with TLS and basic authentication.

Prometheus federation for selective metric scraping.

Prometheus remote write (not covered in depth here).

We then introduce Thanos, an open‑source, highly available Prometheus system with long‑term storage capabilities. Thanos consists of several components that communicate via gRPC:

Thanos Sidecar runs alongside Prometheus, uploading metrics to object storage every two hours, making Prometheus effectively stateless.

Thanos Store acts as a gateway, converting queries to remote object storage.

Thanos Compactor deduplicates and down‑samples data in object storage, reducing storage costs.

Thanos Query provides a PromQL‑compatible endpoint that aggregates queries across multiple stores.

Thanos Query Frontend splits large queries into smaller ones and caches results.

We demonstrate a multi‑cluster deployment on AWS using two EKS clusters (an observer cluster and an observed cluster). The observer cluster runs the full monitoring stack with Grafana, while the observed cluster runs a minimal Prometheus/Thanos installation. Terraform modules ( kube-prometheus-stack and bitnami thanos ) are used to provision the components.

├── env_tags.yaml
├── eu-west-1
│   ├── clusters
│   │   └── observer
│   │       ├── eks
│   │       │   ├── kubeconfig
│   │       │   └── terragrunt.hcl
│   │       ├── eks-addons
│   │       │   └── terragrunt.hcl
│   │       └── vpc
│   │           └── terragrunt.hcl
│   └── region_values.yaml
└── eu-west-3
    └── ...

Key Terraform snippets configure the kube-prometheus-stack with Thanos sidecar enabled, TLS certificates, and ingress settings for Grafana and Thanos components.

kube-prometheus-stack = {
  enabled = true
  thanos_sidecar_enabled = true
  extra_values = <<-EXTRA_VALUES
    grafana:
      deploymentStrategy:
        type: Recreate
      ingress:
        enabled: true
        annotations:
          kubernetes.io/ingress.class: nginx
          cert-manager.io/cluster-issuer: "letsencrypt"
        hosts:
          - grafana.thanos.example.com
        tls:
          - secretName: grafana.thanos.example.com
            hosts:
              - grafana.thanos.example.com
    prometheus:
      prometheusSpec:
        replicas: 1
        retention: 2d
        retentionSize: "10GB"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ebs-sc
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi
  EXTRA_VALUES
}

After deploying, we verify the pods and ingresses in both clusters using kubectl -n monitoring get pods and kubectl -n monitoring get ingress . Logs from the TLS querier show successful addition of new store APIs.

level=info ts=2021-02-23T15:37:35.692346206Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=thanos-sidecar.thanos.example.com:443 extLset="{cluster=\"pio-thanos-observee\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-0\"}"

Port‑forward commands allow us to access the Thanos querier UI and Grafana dashboards, confirming that metrics from multiple clusters are aggregated and visualized correctly.

In summary, Thanos provides a complex but powerful system for scalable, long‑term monitoring. The provided Terraform repository abstracts much of the complexity, especially the mTLS setup, and can be extended to other cloud providers.

For deeper exploration, refer to the official kube-thanos repository and its recommendations for cross‑cluster communication.

monitoringcloud nativeObservabilityKubernetesPrometheusTerraformThanos
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.