Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch
This article explains how to design a Prometheus‑based cloud‑native monitoring solution, covering target selection, metric collection, server configuration, Grafana visualization, and alert management with practical examples and code snippets.
1. Monitoring Targets
Prometheus can monitor infrastructure, middleware, databases, containers, and SaaS services. Typical metrics for each layer (IaaS, PaaS, SaaS) and their collection frequencies are outlined.
IaaS layer : physical/virtual machines – server status, CPU, memory, disk I/O, network traffic, bandwidth, etc. (second‑ or minute‑level).
PaaS layer : databases – cluster status, connections, slow queries, locks, memory usage; middleware – status, connections, sessions; containers – runtime, pod status, resource usage (minute‑ or second‑level).
SaaS layer : application services – availability, request count, response time, HTTP status codes (second‑ or minute‑level).
2. Data Collection
Deploy appropriate exporters or probes on each node. Example:
node_exporteron servers exposes metrics on port 9100, which Prometheus scrapes via HTTP.
Example command: curl localhost:9100/metrics
<code># TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 5.76669737e+06
# TYPE node_disk_info gauge
node_disk_info{device="dm-0",major="252",minor="0"} 1
# ... additional metric lines ...
</code>MySQL exporter provides metrics such as
mysql_global_variables_thread_cache_sizeand
mysql_global_variables_thread_stack.
<code># TYPE mysql_global_variables_thread_cache_size gauge
mysql_global_variables_thread_cache_size 9
# TYPE mysql_global_variables_thread_stack gauge
mysql_global_variables_thread_stack 262144
# ... additional metric lines ...
</code>3. Prometheus Server Configuration
Example job for node metrics:
<code>- job_name: 'node'
metric_path: /metrics
scheme: http
scrape_interval: 30s
scrape_timeout: 20s
file_sd_configs:
- files: ['/prom/targets/node.yml']
refresh_interval: 30s
</code>Labels add key‑value pairs for later querying;
targetslist the actual endpoints (e.g., 192.168.0.1:9100).
4. Visualization with Grafana
Grafana connects to Prometheus as a data source and offers dashboards for resource overview, Kubernetes pod status, namespace statistics, and time‑range queries.
Grafana also manages user permissions through concepts of org, team, role, and user.
5. Alert Management
Alert rules define conditions, duration, severity, and annotations. Example alerts for high memory and disk usage are provided.
<code>- alert: HighMemoryUsage
expr: 100 - (node_memory_MemAvailable_bytes{project='xx'} / node_memory_MemTotal_bytes{project='xx'}) * 100 > 98
for: 5m
labels:
severity: critical
type: server
annotations:
summary: "{{ $labels.mountpoint }} memory usage high!"
description: "Memory usage exceeds 98% (current: {{ printf \"%.2f\" $value }}%)"
</code>Alert states: Inactive, Pending, Firing. Alerts can be sent via email, DingTalk, WeChat, SMS, or webhook.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.