Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring
Prometheus, a CNCF‑graduated, cloud‑native monitoring system, combines pull‑based target discovery, a label‑rich time‑series data model, and four core metric types—gauge, counter, histogram, and summary—to provide near‑real‑time visibility, short‑term retention, alerting via AlertManager, and integration with Grafana and remote storage for scalable observability.
Prometheus is the second open-source project to graduate from CNCF after Kubernetes, originating from Google's Borgmon. This article explores the architecture principles, target discovery mechanisms, metric models, and aggregation queries of Prometheus from the perspective of monitoring fundamentals.
A monitoring system is a productized solution for quantifying and managing technology and business services. It addresses two core problems: (1) Technology - digitizing and visualizing system functions and states to ensure stability and security; (2) Business - digitizing and visualizing business performance for analysis and timely intervention.
Basic Monitoring Principles:
Pre-emptive Monitoring: Monitoring must be considered during architecture design, not after deployment
What to Monitor: Global perspective, top-down from business; focus on user-facing elements first
User-Friendly: Easy-to-use monitoring services with automated integration
Visualization: Clear data display through various charts
Alerting: Define what issues need notification, who to notify, how to notify, frequency, and escalation procedures
Prometheus Architecture:
Prometheus is a near real-time monitoring system with built-in time-series data capabilities. It focuses on current data rather than historical data, as research shows 85% of time-series queries are within 26 hours. It primarily uses pull mode to collect metrics from exposed endpoints, though PushGateway is available for smaller data volumes.
Target Discovery Methods:
Static Configuration: Manual configuration in prometheus.yml with target lists
File-based Service Discovery: Loading configuration from files that are monitored for changes
API-based Service Discovery: Integration with service registries like Consul, Kubernetes, Amazon EC2, Azure
DNS-based Discovery: Querying DNS records for target lists
Metric Types:
Gauge: Numeric values that can increase or decrease (e.g., memory usage)
Counter: monotonically increasing values that only reset to zero (e.g., HTTP request count)
Histogram: Samples observations to show distribution frequency (important for understanding latency percentiles)
Summary: Similar to histogram but aggregates on client-side; suitable for non-aggregated metrics like GC data
Three rules: Use Histogram when aggregating across multiple collection nodes; use Histogram when observing data distribution; use Summary for non-cluster metrics requiring accurate percentiles.
Data Model:
Prometheus uses metric name + labels as unique identifier for time series. Data includes timestamp, metric name, tags (labels), and value. Labels define different dimensions of the same metric; changing labels creates a new time series.
Retention: Prometheus is designed for short-term monitoring and alerting, defaulting to 15 days of retention. For longer storage, consider remote storage solutions like InfluxDB.
Ecosystem: Includes AlertManager for alerting, PushGateway for push-mode data, Grafana for visualization, RemoteStoreAdapter for remote storage, Mtail for log-to-metric conversion, and various Exporters for monitoring applications, machines, databases, and message queues. Client libraries are available for Java, C, Python, and other languages.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.