Operations 21 min read

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

This article shares practical experiences and best‑practice guidelines for deploying and operating Prometheus in Kubernetes, covering version selection, inherent limitations, exporter choices, metric design, multi‑cluster scraping, memory and storage planning, GPU monitoring, timezone handling, and alerting considerations.

Efficient Ops

Nov 24, 2021

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

Monitoring systems have a long history, and Prometheus, as a new‑generation open‑source solution, has become the de‑facto standard in cloud‑native environments.

The article shares practical issues and thoughts encountered when using Prometheus, and suggests reading the container‑monitoring series for background.

Key principles:

Monitoring is infrastructure; collect only necessary metrics to avoid waste of manpower and storage (except for B2B commercial products).

Only fire alerts that can be acted upon.

Keep the architecture simple; the monitoring system must stay up even if the business system fails. Avoid magic systems such as ML‑based thresholds or auto‑remediation.

1. Version selection

Use the latest Prometheus version (e.g., 2.16); older 1.x versions are obsolete. Version 2.16 includes an experimental UI to view TSDB status, top labels, and metrics.

2. Limitations of Prometheus

Metric‑based monitoring; does not handle logs, events, or tracing.

Pull model by default; plan network topology to avoid unnecessary forwarding.

No silver‑bullet solution for clustering and horizontal scaling; choose between federation, Cortex, Thanos, etc.

Typically favors availability over consistency, tolerating some data loss.

Functions like rate and histogram_quantile may produce unintuitive results; long query ranges cause down‑sampling and loss of precision.

3. Common exporters in a K8s cluster

Prometheus, as a CNCF project, offers a rich ecosystem of exporters. Some frequently used exporters include:

cAdvisor (integrated in Kubelet)

Kubelet (port 10255 unauthenticated, 10250 authenticated)

apiserver (port 6443, metrics such as request count and latency)

scheduler (port 10251)

controller‑manager (port 10252)

etcd (write/read latency, storage capacity)

docker (requires experimental flag, metrics‑addr for container creation time, etc.)

kube‑proxy (default 127.0.0.1, port 10249; can expose 0.0.0.0 for external scraping)

kube‑state‑metrics (metadata of pods, deployments, etc.)

node‑exporter (CPU, memory, disk metrics)

blackbox_exporter (network probes: DNS, ping, HTTP)

process‑exporter (process metrics)

nvidia exporter (GPU metrics)

node‑problem‑detector (reports node health taints)

Application exporters (MySQL, Nginx, MQ, etc.)

Custom exporters can also be written for specific scenarios.

4. Monitoring core K8s components with Grafana dashboards

Using the exporters above, Grafana can render dashboards for components such as kubelet and apiserver.

Templates can be based on dashboards-for-kubernetes-administrators and adjusted as needed. Grafana supports templated dropdowns but currently lacks template‑based alert rule configuration.

It would be grate to add templates support in alerts. Otherwise the feature looks useless a bit.

5. All‑in‑One collection component

Exporters are independent, increasing operational overhead. Two approaches to combine them:

Launch a main process that starts multiple exporter processes, still following community version updates.

Use Telegraf to handle various input types, consolidating N exporters into one.

Node‑exporter does not monitor processes; a process‑exporter or Telegraf with procstat input can fill this gap.

6. Choosing golden metrics

Google’s SRE handbook defines four golden signals: latency, traffic, errors, and saturation. In practice, use the Use method for resources (Utilization, Saturation, Errors) and the Red method for services (Rate, Errors, Duration).

Use method: Utilization, Saturation, Errors (e.g., cAdvisor data).

Red method: Rate, Errors, Duration (e.g., apiserver performance metrics).

Service categories:

Online services – focus on request rate, latency, error rate (Red).

Offline services – monitor queue length, in‑flight count, processing speed, errors (Use).

Batch jobs – monitor duration and error count; often use Pushgateway for short‑lived jobs.

7. Cadvisor label compatibility in K8s 1.16

K8s 1.16 removed pod_name and container_name labels, replacing them with pod and container. Adjust queries or Grafana panels accordingly, using relabeling to restore original names.

metric_relabel_configs:
- source_labels: [container]
  regex: (.+)
  target_label: container_name
  replacement: $1
  action: replace
- source_labels: [pod]
  regex: (.+)
  target_label: pod_name
  replacement: $1
  action: replace

8. Scraping external or multiple K8s clusters

When Prometheus runs outside a cluster, certificates and tokens are required. Example job for scraping cadvisor via the apiserver proxy:

- job_name: cluster-cadvisor
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: https://xx:6443
    role: node
    bearer_token_file: token/cluster.token
    tls_config:
      insecure_skip_verify: true
  bearer_token_file: token/cluster.token
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: xx:6443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
  metric_relabel_configs:
  - source_labels: [container]
    separator: ;
    regex: (.+)
    target_label: container_name
    replacement: $1
    action: replace
  - source_labels: [pod]
    separator: ;
    regex: (.+)
    target_label: pod_name
    replacement: $1
    action: replace

For endpoint‑type services (e.g., kube‑state‑metrics), adjust __metrics_path__ accordingly.

9. Collecting GPU metrics

nvidia‑smi shows GPU resources; cadvisor exposes metrics such as:

container_accelerator_duty_cycle
container_accelerator_memory_total_bytes
container_accelerator_memory_used_bytes

For richer GPU data, install the dcgm‑exporter (requires K8s 1.13+).

10. Changing Prometheus display timezone

Prometheus stores timestamps as Unix time (UTC) and does not support timezone configuration.

Grafana can perform timezone conversion for visualisation.

The UI can show timestamps in local timezone starting from version 2.16.

Modifying Prometheus code is possible but not recommended.

11. Scraping metrics behind a Load Balancer

Add a sidecar proxy to the backend service or deploy a proxy on the node to allow Prometheus access.

Configure the LB to forward specific paths (e.g., /backend1, /backend2) to the backends, then scrape the LB.

12. Prometheus large‑memory issues

Memory consumption grows with ingestion rate because data is kept in memory for the 2‑hour block before flushing to disk. Large query ranges and expensive functions (e.g., group, wide rate) also increase memory usage.

Optimization suggestions:

Shard when series exceed ~2 million; use Thanos, VictoriaMetrics, etc., for aggregation.

Identify and drop high‑cost metrics/labels.

Avoid broad queries; keep time range and step ratio reasonable; limit use of group.

Prefer relabeling over joins for related data.

13. Capacity planning

Memory: depends on ingestion rate and block size; reduce series count or increase scrape interval.

Disk: calculate as retention_time_seconds × samples_per_second × bytes_per_sample. Reduce series count or sample rate to lower disk usage.

For single‑node Prometheus, estimate local disk usage; for remote‑write or Thanos, consider object‑storage size.

Example PromQL to monitor sample rate:

rate(prometheus_tsdb_head_samples_appended_total[1h])

14. Impact on Apiserver performance

When using kubernetes_sd_config, Prometheus queries the apiserver, which can increase CPU load at large scale. Direct node scraping can reduce apiserver pressure.

15. Rate calculation logic

Prometheus rate works on counter metrics, handling resets automatically. Because scrape intervals vary, rate values are approximate. Recommended to set the rate window at least four times the scrape interval to ensure enough samples.

When data gaps occur, rate extrapolates based on trends, which may produce misleading results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Prometheus capacity planning Exporters Grafana

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.