Operations 24 min read

Practical Prometheus on Kubernetes: Exporters, Scaling & Tips

This article shares practical experiences and best‑practice guidelines for using Prometheus in Kubernetes environments, covering version selection, inherent limitations, common exporters, Grafana dashboards, metric selection principles, multi‑cluster scraping, GPU monitoring, timezone handling, memory and storage planning, and alerting considerations.

Efficient Ops

Mar 14, 2021

Practical Prometheus on Kubernetes: Exporters, Scaling & Tips

Monitoring has a long history and is a mature field; Prometheus, as a new‑generation open‑source monitoring system, is becoming the de‑facto standard in cloud‑native stacks, proving its popular design.

The article shares problems and thoughts encountered in Prometheus practice; readers unfamiliar with K8s monitoring or Prometheus design can first refer to the container‑monitoring series.

Key Principles:

Monitoring is infrastructure; its goal is to solve problems. Avoid over‑collecting unnecessary metrics, which waste manpower and storage (except for B2B commercial products).

Only emit alerts that need to be handled, and every emitted alert must be addressed.

Simple architecture is the best. When the business system crashes, monitoring must stay up. Avoid “magic” systems like ML‑based thresholds or auto‑remediation, as suggested by Google SRE.

1. Version Selection

The latest Prometheus version is 2.16; the project evolves rapidly, so use the newest version and ignore the 1.x series.

Version 2.16 includes an experimental UI that shows TSDB status, including top‑10 labels and metrics.

2. Limitations of Prometheus

Prometheus is metric‑based and does not handle logs, events, or tracing.

It uses a pull model by default; plan the network accordingly and avoid forwarding.

For clustering and horizontal scaling, there is no silver bullet; choose between federation, Cortex, Thanos, etc., wisely.

Monitoring systems usually favor availability over consistency, tolerating some data loss to ensure query success.

Prometheus does not guarantee data accuracy: functions like rate and histogram_quantile perform statistical inference that can produce counter‑intuitive results; long query ranges require down‑sampling, which reduces precision.

3. Common Exporters in a K8s Cluster

Prometheus, a CNCF project, has a rich exporter ecosystem, unlike traditional agent‑based monitoring such as Zabbix. Exporters can be official or community‑maintained, and you can also write custom exporters.

However, the openness brings selection and trial‑and‑error costs. Maintaining many exporters, especially during upgrades, can be painful, and unofficial exporters may contain bugs.

Typical exporters we use:

cAdvisor (built into kubelet)

kubelet (port 10255 unauthenticated, 10250 authenticated)

apiserver (port 6443, monitor request count, latency, etc.)

scheduler (port 10251)

controller‑manager (port 10252)

etcd (track write/read latency, storage capacity)

docker (enable experimental feature, set metrics-addr for container creation time, etc.)

kube‑proxy (default 127.0.0.1, port 10249; can expose on 0.0.0.0 for external scraping)

kube‑state‑metrics (official K8s project, collects pod, deployment metadata)

node‑exporter (official, gathers CPU, memory, disk metrics)

blackbox_exporter (network probing: DNS, ping, HTTP)

process‑exporter (process‑level metrics)

nvidia exporter (GPU monitoring)

node‑problem‑detector (reports node health, adds taints)

Application exporters: MySQL, Nginx, MQ, etc., based on business needs.

Custom exporters can also be built for specific scenarios such as log extraction.

4. Monitoring Core K8s Components & Grafana Dashboards

Key components like kubelet and apiserver can be visualized in Grafana using the metrics from the exporters above.

Dashboards can be based on dashboards-for-kubernetes-administrators and tuned continuously for alert thresholds.

Grafana supports templating for multi‑level dropdowns, but currently does not support templated alert rules (see related issue).

It would be great to add templates support in alerts. Otherwise the feature looks useless a bit.

5. All‑In‑One Collection Component

Exporters are independent; many exporters increase operational overhead, especially for resource control and version upgrades. Two approaches to combine them:

Launch N exporter processes from a main process; still follow community updates.

Use Telegraf to handle various input types (N‑in‑1).

Node‑exporter does not monitor processes; a process‑exporter or Telegraf with procstat input can fill that gap.

6. Choosing Golden Metrics

Google’s SRE Handbook defines four “golden signals”: latency, traffic, error count, and saturation. In practice, use the “Use” method for resources and the “Red” method for services.

Use method: Utilization, Saturation, Errors (e.g., cAdvisor data).

Red method: Rate, Errors, Duration (e.g., apiserver performance metrics).

Prometheus services are categorized as:

Online services (web, DB): monitor request rate, latency, error rate (Red).

Offline services (log processing, MQ): monitor queue length, in‑flight count, processing speed, errors (Use).

Batch jobs (CI, K8s jobs/cronjobs): monitor duration and error count; often use Pushgateway for short‑lived tasks.

See “Container Monitoring Practice – Common K8s Metrics” for concrete examples.

7. Cadvisor Label Compatibility in K8s 1.16

K8s 1.16 removed pod_Name and container_name labels from Cadvisor, replacing them with pod and container. Adjust queries or Grafana panels accordingly. Relabel configs can restore the original names.

metric_relabel_configs:
- source_labels: [container]
  regex: (.+)
  target_label: container_name
  replacement: $1
  action: replace
- source_labels: [pod]
  regex: (.+)
  target_label: pod_name
  replacement: $1
  action: replace

Use metric_relabel_configs, not relabel_configs, for this replacement.

8. Scraping External or Multi‑Cluster K8s with Prometheus

When Prometheus runs inside a cluster, the built‑in YAML makes scraping easy. For external deployment, certificates and tokens are required, and the address must be replaced. Example job for Cadvisor via Apiserver proxy:

- job_name: cluster-cadvisor
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: https://xx:6443
    role: node
    bearer_token_file: token/cluster.token
    tls_config:
      insecure_skip_verify: true
  bearer_token_file: token/cluster.token
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: xx:6443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
  metric_relabel_configs:
  - source_labels: [container]
    separator: ;
    regex: (.+)
    target_label: container_name
    replacement: $1
    action: replace
  - source_labels: [pod]
    separator: ;
    regex: (.+)
    target_label: pod_name
    replacement: $1
    action: replace

For endpoint‑type exporters (e.g., kube‑state‑metrics), the job looks like:

- job_name: cluster-service-endpoints
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: https://xxx:6443
    role: endpoints
    bearer_token_file: token/cluster.token
    tls_config:
      insecure_skip_verify: true
  bearer_token_file: token/cluster.token
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    action: keep
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: xxx:6443
    action: replace
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_endpoints_name, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: (.+);(.+);(.*)
    target_label: __metrics_path__
    replacement: /api/v1/namespaces/${1}/services/${2}:${3}/proxy/metrics
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace

Multiple clusters can be handled by duplicating such job definitions, typically with three job types: role:node (cadvisor, node‑exporter, kubelet, etc.), role:endpoint (kube‑state‑metrics, custom exporters), and generic jobs for etcd, apiserver, process metrics.

9. Obtaining GPU Metrics

nvidia‑smi shows GPU resources; Cadvisor exposes container‑level GPU metrics such as:

container_accelerator_duty_cycle
container_accelerator_memory_total_bytes
container_accelerator_memory_used_bytes

For more detailed GPU data, install the dcgm exporter (requires K8s 1.13+).

10. Changing Prometheus Display Timezone

Prometheus stores timestamps as Unix time in UTC and does not support configuring a timezone in the config file or reading the host /etc/timezone. Visualization tools like Grafana can perform timezone conversion. The newer 2.16 UI includes a “Local Timezone” option.

11. Scraping Metrics Behind a Load Balancer

If Prometheus can only reach the LB but not the backend ReplicaSet (RS), options include adding a sidecar proxy to the RS service or deploying a local proxy on the Prometheus host, or configuring the LB to forward specific paths (e.g., /backend1, /backend2) that Prometheus can scrape.

12. Prometheus Large‑Memory Issues

As scale grows, CPU and memory usage increase; memory often becomes the bottleneck. Causes include:

Prometheus keeps all data in memory for the two‑hour block before flushing to disk.

Loading historic data moves data from disk to memory; larger query ranges consume more memory.

Inefficient queries (e.g., large group or wide rate) increase memory usage.

Memory estimation can be done with a calculator based on series count and scrape interval. Example: 950k series retained for 2 h consumes roughly X GB (see accompanying charts).

Optimization suggestions:

When series exceed ~2 M, move to sharding with solutions like VictoriaMetrics, Thanos, or Trickster.

Identify high‑cost metrics/labels and drop unnecessary ones (available from TSDB UI in 2.14+).

Avoid wide‑range queries; keep time range and step ratio reasonable; limit use of group.

Prefer relabeling to add labels instead of joining tables, as time‑series DBs are not relational.

Memory profiling can be performed with pprof (see Robust Perception article). Historical memory usage for 1.x versions is documented in the linked articles.

13. Capacity Planning

Beyond memory, disk storage must be planned based on architecture:

Single‑node Prometheus: calculate local disk usage.

Remote‑write setups: share storage with existing TSDB.

Thanos: local disk holds only hot data (e.g., 2 h); main storage is object storage.

Prometheus compresses in‑memory data into blocks every two hours, storing chunks, indexes, tombstones, and metadata. Each sample occupies roughly 1‑2 bytes. Sample rate can be inspected with:

rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h]){instance="0.0.0.0:8890", job="prometheus"}

Disk size can be approximated as:

disk_size = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

To reduce disk demand without changing retention or sample size, lower the ingestion rate (fewer series or longer scrape intervals). Example: 30 s scrape interval, 1000 nodes, 6000 metric types → ~30 GB disk usage.

14. Impact on Apiserver Performance

When using kubernetes_sd_config, Prometheus queries pass through the Apiserver. At large scale, this can increase Apiserver CPU usage, especially on proxy failures. Splitting clusters or monitoring Apiserver process metrics helps mitigate impact.

15. Rate Calculation Logic

Prometheus counters exist primarily for rate() calculations. Counters reset on restart, and rate() automatically handles resets, providing an approximate per‑second increase.

Because scrape intervals differ across targets, rate() values can jitter. Missing data points cause rate() to extrapolate based on trends, which may produce misleading spikes.

Best practice: set the range vector for rate() to at least four times the scrape interval (e.g., 4‑5 minutes for a 1‑minute scrape) to ensure at least two samples are available even after a missed scrape.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Prometheus capacity planning Exporters Grafana

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.