Operations 21 min read

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

This article shares practical experiences and best‑practice guidelines for deploying and operating Prometheus in Kubernetes, covering version selection, inherent limitations, exporter choices, metric design, multi‑cluster scraping, memory and storage planning, GPU monitoring, timezone handling, and alerting considerations.

Efficient Ops
Efficient Ops
Efficient Ops
Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

Monitoring systems have a long history, and Prometheus, as a new‑generation open‑source solution, has become the de‑facto standard in cloud‑native environments.

The article shares practical issues and thoughts encountered when using Prometheus, and suggests reading the container‑monitoring series for background.

Key principles:

Monitoring is infrastructure; collect only necessary metrics to avoid waste of manpower and storage (except for B2B commercial products).

Only fire alerts that can be acted upon.

Keep the architecture simple; the monitoring system must stay up even if the business system fails. Avoid magic systems such as ML‑based thresholds or auto‑remediation.

1. Version selection

Use the latest Prometheus version (e.g., 2.16); older 1.x versions are obsolete. Version 2.16 includes an experimental UI to view TSDB status, top labels, and metrics.

2. Limitations of Prometheus

Metric‑based monitoring; does not handle logs, events, or tracing.

Pull model by default; plan network topology to avoid unnecessary forwarding.

No silver‑bullet solution for clustering and horizontal scaling; choose between federation, Cortex, Thanos, etc.

Typically favors availability over consistency, tolerating some data loss.

Functions like

rate

and

histogram_quantile

may produce unintuitive results; long query ranges cause down‑sampling and loss of precision.

3. Common exporters in a K8s cluster

Prometheus, as a CNCF project, offers a rich ecosystem of exporters. Some frequently used exporters include:

cAdvisor (integrated in Kubelet)

Kubelet (port 10255 unauthenticated, 10250 authenticated)

apiserver (port 6443, metrics such as request count and latency)

scheduler (port 10251)

controller‑manager (port 10252)

etcd (write/read latency, storage capacity)

docker (requires experimental flag, metrics‑addr for container creation time, etc.)

kube‑proxy (default 127.0.0.1, port 10249; can expose 0.0.0.0 for external scraping)

kube‑state‑metrics (metadata of pods, deployments, etc.)

node‑exporter (CPU, memory, disk metrics)

blackbox_exporter (network probes: DNS, ping, HTTP)

process‑exporter (process metrics)

nvidia exporter (GPU metrics)

node‑problem‑detector (reports node health taints)

Application exporters (MySQL, Nginx, MQ, etc.)

Custom exporters can also be written for specific scenarios.

4. Monitoring core K8s components with Grafana dashboards

Using the exporters above, Grafana can render dashboards for components such as kubelet and apiserver.

Templates can be based on

dashboards-for-kubernetes-administrators

and adjusted as needed. Grafana supports templated dropdowns but currently lacks template‑based alert rule configuration.

<code>It would be grate to add templates support in alerts. Otherwise the feature looks useless a bit.</code>

5. All‑in‑One collection component

Exporters are independent, increasing operational overhead. Two approaches to combine them:

Launch a main process that starts multiple exporter processes, still following community version updates.

Use Telegraf to handle various input types, consolidating N exporters into one.

Node‑exporter does not monitor processes; a process‑exporter or Telegraf with

procstat

input can fill this gap.

6. Choosing golden metrics

Google’s SRE handbook defines four golden signals: latency, traffic, errors, and saturation. In practice, use the Use method for resources (Utilization, Saturation, Errors) and the Red method for services (Rate, Errors, Duration).

Use method: Utilization, Saturation, Errors (e.g., cAdvisor data).

Red method: Rate, Errors, Duration (e.g., apiserver performance metrics).

Service categories:

Online services – focus on request rate, latency, error rate (Red).

Offline services – monitor queue length, in‑flight count, processing speed, errors (Use).

Batch jobs – monitor duration and error count; often use Pushgateway for short‑lived jobs.

7. Cadvisor label compatibility in K8s 1.16

K8s 1.16 removed

pod_name

and

container_name

labels, replacing them with

pod

and

container

. Adjust queries or Grafana panels accordingly, using relabeling to restore original names.

<code>metric_relabel_configs:
- source_labels: [container]
  regex: (.+)
  target_label: container_name
  replacement: $1
  action: replace
- source_labels: [pod]
  regex: (.+)
  target_label: pod_name
  replacement: $1
  action: replace</code>

8. Scraping external or multiple K8s clusters

When Prometheus runs outside a cluster, certificates and tokens are required. Example job for scraping cadvisor via the apiserver proxy:

<code>- job_name: cluster-cadvisor
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: https://xx:6443
    role: node
    bearer_token_file: token/cluster.token
    tls_config:
      insecure_skip_verify: true
  bearer_token_file: token/cluster.token
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: xx:6443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
  metric_relabel_configs:
  - source_labels: [container]
    separator: ;
    regex: (.+)
    target_label: container_name
    replacement: $1
    action: replace
  - source_labels: [pod]
    separator: ;
    regex: (.+)
    target_label: pod_name
    replacement: $1
    action: replace</code>

For endpoint‑type services (e.g., kube‑state‑metrics), adjust

__metrics_path__

accordingly.

9. Collecting GPU metrics

nvidia‑smi shows GPU resources; cadvisor exposes metrics such as:

<code>container_accelerator_duty_cycle
container_accelerator_memory_total_bytes
container_accelerator_memory_used_bytes</code>

For richer GPU data, install the dcgm‑exporter (requires K8s 1.13+).

10. Changing Prometheus display timezone

Prometheus stores timestamps as Unix time (UTC) and does not support timezone configuration.

Grafana can perform timezone conversion for visualisation.

The UI can show timestamps in local timezone starting from version 2.16.

Modifying Prometheus code is possible but not recommended.

11. Scraping metrics behind a Load Balancer

Add a sidecar proxy to the backend service or deploy a proxy on the node to allow Prometheus access.

Configure the LB to forward specific paths (e.g., /backend1, /backend2) to the backends, then scrape the LB.

12. Prometheus large‑memory issues

Memory consumption grows with ingestion rate because data is kept in memory for the 2‑hour block before flushing to disk. Large query ranges and expensive functions (e.g.,

group

, wide

rate

) also increase memory usage.

Optimization suggestions:

Shard when series exceed ~2 million; use Thanos, VictoriaMetrics, etc., for aggregation.

Identify and drop high‑cost metrics/labels.

Avoid broad queries; keep time range and step ratio reasonable; limit use of

group

.

Prefer relabeling over joins for related data.

13. Capacity planning

Memory: depends on ingestion rate and block size; reduce series count or increase scrape interval.

Disk: calculate as retention_time_seconds × samples_per_second × bytes_per_sample. Reduce series count or sample rate to lower disk usage.

For single‑node Prometheus, estimate local disk usage; for remote‑write or Thanos, consider object‑storage size.

Example PromQL to monitor sample rate:

<code>rate(prometheus_tsdb_head_samples_appended_total[1h])</code>

14. Impact on Apiserver performance

When using

kubernetes_sd_config

, Prometheus queries the apiserver, which can increase CPU load at large scale. Direct node scraping can reduce apiserver pressure.

15. Rate calculation logic

Prometheus

rate

works on counter metrics, handling resets automatically. Because scrape intervals vary, rate values are approximate. Recommended to set the rate window at least four times the scrape interval to ensure enough samples.

When data gaps occur,

rate

extrapolates based on trends, which may produce misleading results.

monitoringperformanceKubernetesPrometheusCapacity PlanningExportersGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.