Avoid These 6 Common Prometheus Mistakes When Getting Started
This guide translates and condenses six frequent errors new Prometheus users make—high‑cardinality labels, losing valuable tags during aggregation, using bare selectors, omitting the for field, choosing too‑short rate windows, and applying rate‑related functions to wrong metric types—offering practical fixes to improve monitoring reliability.
This article is translated from https://promlabs.com/blog/2022/12/11/avoid-these-6-mistakes-when-getting-started-with-prometheus. The author’s summary of common Prometheus pitfalls is presented here for review and self‑reflection.
Mistake 1: High‑Cardinality Explosion
Prometheus stores time series using multiple labels, which is flexible but can cause severe performance issues (including OOM) if a label’s values are not sufficiently convergent. Adding a high‑cardinality label such as a unique user ID creates a separate series for each value.
Example of a low‑cardinality metric:
<code>http_requests_total{method="POST"}
http_requests_total{method="GET"}
http_requests_total{method="PUT"}
http_requests_total{method="DELETE"}</code>Adding a user_id label creates many series:
<code>http_requests_total{method="POST",user_id="1"}
http_requests_total{method="POST",user_id="2"}
... (many more) ...
http_requests_total{method="GET",user_id="16434313"}</code>When the number of distinct users is large, memory usage spikes and can lead to OOM. Avoid high‑cardinality values such as public IPs, email addresses, full HTTP request paths with dynamic IDs, and process IDs unless they form a limited set. Use placeholders (e.g.,
/api/users/{user_id}/posts/{post_id}) to reduce cardinality.
Mistake 2: Losing Valuable Labels During Aggregation
When writing alert rules, aggregations like
sum()drop all labels by default, which can remove useful routing information such as the
joblabel. Preserve needed labels with
sum by(job)or use
sum without(instance, type)to exclude only unwanted labels.
Mistake 3: Using Bare Selectors
Writing PromQL queries without restricting the selector (e.g.,
rate(errors_total[5m]) > 10) may pull data from unrelated jobs that share the same metric name, causing false alerts and performance issues. Always scope queries with a label like
{job="my-job"}.
Mistake 4: Omitting the for Field in Alert Rules
The
forfield defines how long a condition must persist before an alert fires, helping to filter out transient spikes. Example without
for:
<code>alert: InstanceDown
expr: up == 0</code>Improved rule with
for:
<code>alert: InstanceDown
expr: up == 0
for: 5m</code>Adding
forto most alerts makes them more robust, though it may increase detection latency.
Mistake 5: Using Too‑Short Rate Windows
Rate functions need at least two samples within the window. If the window is shorter than the scrape interval, the function may return no data. Choose a window at least four times the scrape interval to handle occasional scrape failures and alignment issues.
Example of a too‑short window (1 min) on a 15 s scrape interval can miss samples, while a 4× interval (e.g., 60 s) provides reliable results.
Mistake 6: Applying Rate‑Related Functions to Wrong Metric Types
rate(),
irate(), and
increase()are designed for counter metrics, which only increase. Using them on gauges (e.g., memory usage) leads to incorrect results because decreases are interpreted as counter resets.
deriv()works on gauges but should not be used on counters, as it lacks reset compensation and can produce negative values.
To avoid these mistakes, verify metric types before applying functions, and consider tools like PromLens to help detect mismatches.
Conclusion
The six points above highlight frequent pitfalls for newcomers to Prometheus and provide practical tips to improve monitoring setups.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.