Mastering IaaS Monitoring: Strategies for Servers, Networks, and Traffic
This article, the second in a monitoring and alerting series, explains how to comprehensively monitor the IaaS layer—including server health, network device performance, and traffic analysis—by classifying resources and applying status, performance, capacity, and quality dimensions to achieve unified operational insight.
Overview
This article is the second in a series on monitoring and alerting products, focusing on IaaS layer monitoring, including server status and performance, network device status and performance, and network traffic analysis.
IaaS
IaaS, PaaS, and SaaS are the three cloud‑computing layers. IaaS (Infrastructure‑as‑a‑Service) provides visible resources such as servers, network devices, and storage. It is the foundation; poor IaaS management affects higher layers.
IaaS Monitoring
Monitoring IaaS means monitoring each resource object (physical servers, switches, dedicated lines, public IPs). Four dimensions are used: status, performance, capacity, and quality.
Status Monitoring : device alive, port status, power, fan.
Performance Monitoring : memory size, port traffic, CPU utilization.
Quality Monitoring : packet loss, error rate, latency.
Capacity Monitoring : load utilization of devices, bandwidth usage, etc.
Monitoring Product Layer Structure
Most commercial or open‑source monitoring‑alerting products adopt a layered architecture: data collection at the bottom, then storage, analysis, visualization, alerting, and handling.
Data Collection
Enterprise monitoring systems support multiple collection methods (agent reporting, SNMP, IPMI, etc.) and objects (servers, OS metrics, network devices, sessions, lines, etc.). Different objects use appropriate collection methods.
Basic Concepts
Monitoring and alerting involve collection, storage, analysis, display, alerting, and handling of a concrete object. The principle is to manage objects first, then monitor them.
Alarm (Monitoring) Object
Defined as a specific resource in CMDB or a custom CI, e.g., a physical server, a business tier, a TDSQL instance.
Alarm (Monitoring) Metric
One or more feature IDs or derived calculations, e.g., CPU usage, memory usage, or success rate = (successful requests / total requests) * 100.
Alarm (Monitoring) Type
Specifies an algorithm applied to a set of metrics for an object, e.g., single‑machine performance alarm covering CPU, memory, etc.
Alarm Rule
Combines object, metric, condition, and notification convergence (threshold, count, duration) into a policy, e.g., CPU usage > 80% on a switch.
Alarm Strategy
Groups multiple rules for an object and type; aims to simplify management for many services.
Flow
Flow is a data‑exchange method that caches the first IP packet of a flow and aggregates subsequent packets, containing fields such as source/destination IP, ports, protocol, ToS, and input interface.
Network Flow
Analyzing flow data helps answer usage, protocol, direction, loss, and latency questions.
Network Device Monitoring
Focuses on device status, performance, quality, and capacity, including syslog monitoring, stack status, port traffic, and logical‑port performance. Syslog alerts require grouping devices by vendor, type, model, and purpose.
Server Monitoring
Monitors status (ping, agent timeout, power), performance (CPU, memory, traffic, packets), and capacity. Rich data enables deeper insight and diverse consumption scenarios.
Summary
IaaS monitoring classifies resources and evaluates them across status, performance, capacity, and quality, providing a unified view for development and operations. Building a monitoring and alerting product is complex and requires consideration of technical capabilities, user and system perspectives, and permission models.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.