Practical Prometheus on Kubernetes: Exporters, Scaling & Tips
This article shares practical experiences and best‑practice guidelines for using Prometheus in Kubernetes environments, covering version selection, inherent limitations, common exporters, Grafana dashboards, metric selection principles, multi‑cluster scraping, GPU monitoring, timezone handling, memory and storage planning, and alerting considerations.
Monitoring has a long history and is a mature field; Prometheus, as a new‑generation open‑source monitoring system, is becoming the de‑facto standard in cloud‑native stacks, proving its popular design.
The article shares problems and thoughts encountered in Prometheus practice; readers unfamiliar with K8s monitoring or Prometheus design can first refer to the container‑monitoring series.
Key Principles:
Monitoring is infrastructure; its goal is to solve problems. Avoid over‑collecting unnecessary metrics, which waste manpower and storage (except for B2B commercial products).
Only emit alerts that need to be handled, and every emitted alert must be addressed.
Simple architecture is the best. When the business system crashes, monitoring must stay up. Avoid “magic” systems like ML‑based thresholds or auto‑remediation, as suggested by Google SRE.
1. Version Selection
The latest Prometheus version is 2.16; the project evolves rapidly, so use the newest version and ignore the 1.x series.
Version 2.16 includes an experimental UI that shows TSDB status, including top‑10 labels and metrics.
2. Limitations of Prometheus
Prometheus is metric‑based and does not handle logs, events, or tracing.
It uses a pull model by default; plan the network accordingly and avoid forwarding.
For clustering and horizontal scaling, there is no silver bullet; choose between federation, Cortex, Thanos, etc., wisely.
Monitoring systems usually favor availability over consistency, tolerating some data loss to ensure query success.
Prometheus does not guarantee data accuracy: functions like
rateand
histogram_quantileperform statistical inference that can produce counter‑intuitive results; long query ranges require down‑sampling, which reduces precision.
3. Common Exporters in a K8s Cluster
Prometheus, a CNCF project, has a rich exporter ecosystem, unlike traditional agent‑based monitoring such as Zabbix. Exporters can be official or community‑maintained, and you can also write custom exporters.
However, the openness brings selection and trial‑and‑error costs. Maintaining many exporters, especially during upgrades, can be painful, and unofficial exporters may contain bugs.
Typical exporters we use:
cAdvisor (built into kubelet)
kubelet (port 10255 unauthenticated, 10250 authenticated)
apiserver (port 6443, monitor request count, latency, etc.)
scheduler (port 10251)
controller‑manager (port 10252)
etcd (track write/read latency, storage capacity)
docker (enable experimental feature, set
metrics-addrfor container creation time, etc.)
kube‑proxy (default 127.0.0.1, port 10249; can expose on 0.0.0.0 for external scraping)
kube‑state‑metrics (official K8s project, collects pod, deployment metadata)
node‑exporter (official, gathers CPU, memory, disk metrics)
blackbox_exporter (network probing: DNS, ping, HTTP)
process‑exporter (process‑level metrics)
nvidia exporter (GPU monitoring)
node‑problem‑detector (reports node health, adds taints)
Application exporters: MySQL, Nginx, MQ, etc., based on business needs.
Custom exporters can also be built for specific scenarios such as log extraction.
4. Monitoring Core K8s Components & Grafana Dashboards
Key components like kubelet and apiserver can be visualized in Grafana using the metrics from the exporters above.
Dashboards can be based on
dashboards-for-kubernetes-administratorsand tuned continuously for alert thresholds.
Grafana supports templating for multi‑level dropdowns, but currently does not support templated alert rules (see related issue).
<code>It would be great to add templates support in alerts. Otherwise the feature looks useless a bit.</code>5. All‑In‑One Collection Component
Exporters are independent; many exporters increase operational overhead, especially for resource control and version upgrades. Two approaches to combine them:
Launch N exporter processes from a main process; still follow community updates.
Use Telegraf to handle various input types (N‑in‑1).
Node‑exporter does not monitor processes; a process‑exporter or Telegraf with
procstatinput can fill that gap.
6. Choosing Golden Metrics
Google’s SRE Handbook defines four “golden signals”: latency, traffic, error count, and saturation. In practice, use the “Use” method for resources and the “Red” method for services.
Use method: Utilization, Saturation, Errors (e.g., cAdvisor data).
Red method: Rate, Errors, Duration (e.g., apiserver performance metrics).
Prometheus services are categorized as:
Online services (web, DB): monitor request rate, latency, error rate (Red).
Offline services (log processing, MQ): monitor queue length, in‑flight count, processing speed, errors (Use).
Batch jobs (CI, K8s jobs/cronjobs): monitor duration and error count; often use Pushgateway for short‑lived tasks.
See “Container Monitoring Practice – Common K8s Metrics” for concrete examples.
7. Cadvisor Label Compatibility in K8s 1.16
K8s 1.16 removed
pod_Nameand
container_namelabels from Cadvisor, replacing them with
podand
container. Adjust queries or Grafana panels accordingly. Relabel configs can restore the original names.
<code>metric_relabel_configs:
- source_labels: [container]
regex: (.+)
target_label: container_name
replacement: $1
action: replace
- source_labels: [pod]
regex: (.+)
target_label: pod_name
replacement: $1
action: replace</code>Use
metric_relabel_configs, not
relabel_configs, for this replacement.
8. Scraping External or Multi‑Cluster K8s with Prometheus
When Prometheus runs inside a cluster, the built‑in YAML makes scraping easy. For external deployment, certificates and tokens are required, and the address must be replaced. Example job for Cadvisor via Apiserver proxy:
<code>- job_name: cluster-cadvisor
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: https://xx:6443
role: node
bearer_token_file: token/cluster.token
tls_config:
insecure_skip_verify: true
bearer_token_file: token/cluster.token
tls_config:
insecure_skip_verify: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: xx:6443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
metric_relabel_configs:
- source_labels: [container]
separator: ;
regex: (.+)
target_label: container_name
replacement: $1
action: replace
- source_labels: [pod]
separator: ;
regex: (.+)
target_label: pod_name
replacement: $1
action: replace</code>For endpoint‑type exporters (e.g., kube‑state‑metrics), the job looks like:
<code>- job_name: cluster-service-endpoints
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: https://xxx:6443
role: endpoints
bearer_token_file: token/cluster.token
tls_config:
insecure_skip_verify: true
bearer_token_file: token/cluster.token
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
action: keep
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: xxx:6443
action: replace
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_endpoints_name, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: (.+);(.+);(.*)
target_label: __metrics_path__
replacement: /api/v1/namespaces/${1}/services/${2}:${3}/proxy/metrics
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace</code>Multiple clusters can be handled by duplicating such job definitions, typically with three job types:
role:node(cadvisor, node‑exporter, kubelet, etc.),
role:endpoint(kube‑state‑metrics, custom exporters), and generic jobs for etcd, apiserver, process metrics.
9. Obtaining GPU Metrics
nvidia‑smi shows GPU resources; Cadvisor exposes container‑level GPU metrics such as:
<code>container_accelerator_duty_cycle
container_accelerator_memory_total_bytes
container_accelerator_memory_used_bytes</code>For more detailed GPU data, install the dcgm exporter (requires K8s 1.13+).
10. Changing Prometheus Display Timezone
Prometheus stores timestamps as Unix time in UTC and does not support configuring a timezone in the config file or reading the host
/etc/timezone. Visualization tools like Grafana can perform timezone conversion. The newer 2.16 UI includes a “Local Timezone” option.
11. Scraping Metrics Behind a Load Balancer
If Prometheus can only reach the LB but not the backend ReplicaSet (RS), options include adding a sidecar proxy to the RS service or deploying a local proxy on the Prometheus host, or configuring the LB to forward specific paths (e.g.,
/backend1,
/backend2) that Prometheus can scrape.
12. Prometheus Large‑Memory Issues
As scale grows, CPU and memory usage increase; memory often becomes the bottleneck. Causes include:
Prometheus keeps all data in memory for the two‑hour block before flushing to disk.
Loading historic data moves data from disk to memory; larger query ranges consume more memory.
Inefficient queries (e.g., large
groupor wide
rate) increase memory usage.
Memory estimation can be done with a calculator based on series count and scrape interval. Example: 950k series retained for 2 h consumes roughly X GB (see accompanying charts).
Optimization suggestions:
When series exceed ~2 M, move to sharding with solutions like VictoriaMetrics, Thanos, or Trickster.
Identify high‑cost metrics/labels and drop unnecessary ones (available from TSDB UI in 2.14+).
Avoid wide‑range queries; keep time range and step ratio reasonable; limit use of
group.
Prefer relabeling to add labels instead of joining tables, as time‑series DBs are not relational.
Memory profiling can be performed with pprof (see Robust Perception article). Historical memory usage for 1.x versions is documented in the linked articles.
13. Capacity Planning
Beyond memory, disk storage must be planned based on architecture:
Single‑node Prometheus: calculate local disk usage.
Remote‑write setups: share storage with existing TSDB.
Thanos: local disk holds only hot data (e.g., 2 h); main storage is object storage.
Prometheus compresses in‑memory data into blocks every two hours, storing chunks, indexes, tombstones, and metadata. Each sample occupies roughly 1‑2 bytes. Sample rate can be inspected with:
<code>rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h]){instance="0.0.0.0:8890", job="prometheus"}</code>Disk size can be approximated as:
<code>disk_size = retention_time_seconds * ingested_samples_per_second * bytes_per_sample</code>To reduce disk demand without changing retention or sample size, lower the ingestion rate (fewer series or longer scrape intervals). Example: 30 s scrape interval, 1000 nodes, 6000 metric types → ~30 GB disk usage.
14. Impact on Apiserver Performance
When using
kubernetes_sd_config, Prometheus queries pass through the Apiserver. At large scale, this can increase Apiserver CPU usage, especially on proxy failures. Splitting clusters or monitoring Apiserver process metrics helps mitigate impact.
15. Rate Calculation Logic
Prometheus counters exist primarily for
rate()calculations. Counters reset on restart, and
rate()automatically handles resets, providing an approximate per‑second increase.
Because scrape intervals differ across targets,
rate()values can jitter. Missing data points cause
rate()to extrapolate based on trends, which may produce misleading spikes.
Best practice: set the range vector for
rate()to at least four times the scrape interval (e.g., 4‑5 minutes for a 1‑minute scrape) to ensure at least two samples are available even after a missed scrape.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.