9 Essential Metrics for Effective Microservice Monitoring
This article outlines nine crucial microservice monitoring indicators—including request tracing, health checks, throughput, response time, success and error rates, concurrent connections, CPU/memory usage, and resource utilization—to help engineers assess performance and reliability in distributed systems.
To monitor microservices effectively, you need to track specific performance and health indicators. Below are nine key metrics commonly used in production environments.
Service Request Tracing
Request tracing records the call chain of a request across multiple services in a microservice architecture.
The typical tracing flow includes:
Request entry: A unique Trace ID is generated and attached to the incoming request.
Service call: The first service processes the request and may call downstream services, passing the Trace ID along.
Record information: Each service logs its processing details together with the Trace ID (e.g., latency, errors).
Pass Trace ID: When invoking another service, the Trace ID is forwarded to maintain the chain.
Request exit: After the final response is returned to the client, the complete trace is aggregated and stored.
Tracing enables performance analysis and fault isolation using tools such as SkyWalking, Zipkin, Sleuth, Jaeger, or PinPoint.
Service Instance Health Status
Health of a service instance is usually monitored through several mechanisms:
Heartbeat Check: Instances periodically send heartbeat signals to a registry or health‑check component. Missing heartbeats indicate the instance may be down.
Health Check: Instances perform self‑checks of critical metrics and dependencies. Failure removes the instance from service discovery.
Load‑Balancer Health Check: Load balancers verify instance liveness; unhealthy instances stop receiving traffic.
Log and Metric Monitoring: Abnormal logs or error‑rate spikes help identify unhealthy instances.
Self‑Healing & Auto‑Scaling: Detected failures can trigger automatic restarts or scaling of new instances.
Throughput
Throughput measures the number of requests a service handles over a period, typically expressed as Requests Per Second (RPS).
For example, a service with a throughput of 100 RPS can process 100 requests each second, indicating its capacity under load.
Request Response Time
Response time measures the latency from receiving a request to sending the response. Shorter times generally reflect better performance (e.g., average 100 ms).
Request Success Rate
Success rate is the proportion of successfully processed requests out of the total, reflecting service availability.
Successful requests: 1,000
Total requests: 1,100
Success rate: 90.91 %
Error Rate
Error rate indicates the fraction of requests that resulted in errors.
Error requests: 50
Total requests: 1,100
Error rate: 4.55 %
Concurrent Connections
Concurrent connections represent the number of client connections active at a given moment.
For instance, 100 concurrent connections mean 100 clients are communicating with the service simultaneously.
CPU and Memory Usage
CPU and memory utilization indicate how much of the instance’s resources are being consumed.
Typical values might be CPU usage 60 % and memory usage 70 %.
Resource Utilization
Resource utilization covers other consumables such as database connection pools, helping assess whether resources are over‑ or under‑used.
These metrics should be selected based on specific business needs and system characteristics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mike Chen's Internet Architecture
Over ten years of BAT architecture experience, shared generously!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
