Service Governance: Monitoring, Fault Management, Release and Capacity Planning
This article explains how to achieve 24/7 service availability through comprehensive monitoring, fault handling, release management, and capacity planning, covering alarm types, batch processing, traffic and resource metrics, fault causes and mitigation, deployment strategies, scaling commands, and service degradation techniques.
Hello everyone, I am Chen.
I was woken up at 4 am by a monitoring alarm indicating a failure in a production batch job. After a lengthy investigation I resolved the issue.
The failure occurred in a data‑validation batch that checks the correctness of data produced by previous jobs. Fortunately the core transaction system was unaffected.
Transaction systems are the entry point for business traffic; any outage directly impacts revenue.
The topic of this article is service governance, whose ultimate goal is a "7 × 24" hour uninterrupted service.
1. Monitoring Alarms
The production alarm accurately identified the responsible owner and the failing batch task by monitoring the middleware execution results.
Typical alarm types are shown in the diagram below:
1.1 Batch Processing Efficiency
In most cases batch jobs do not block the business entry, so they do not need monitoring.
When batch jobs do block the entry, they must be monitored. Two business scenarios are:
Domain name system uses DNS and DB records to find dirty data for transaction compensation, during which customers may query dirty data.
Bank end‑of‑day batch processing disallows real‑time transactions, which conflicts with the "7 × 24" goal.
In these scenarios batch efficiency is a critical monitoring metric; timeout thresholds must be configured.
1.2 Traffic Monitoring
Common throttling metrics are illustrated below:
Traffic monitoring should consider:
Different systems use different metrics, e.g., redis can be monitored by QPS , while transaction systems use TPS .
Configure appropriate thresholds based on testing and traffic forecasts.
Account for burst scenarios such as flash sales or coupon grabs.
1.3 Exception Monitoring
Exception monitoring is crucial because production environments cannot guarantee error‑free execution. Proper alerts help quickly locate and resolve issues, as demonstrated by the batch alarm at the article’s start.
Key points for exception monitoring:
Client read timeout – investigate server‑side causes promptly.
Set a response time threshold, e.g., 1 second, to trigger alerts.
Monitor business‑level failures such as error response codes.
1.4 Resource Utilization
When provisioning production resources, predict usage. For example, estimate how long redis memory will last at the current growth rate, or when a database will exhaust disk space.
Set utilization thresholds, e.g., 70% ; exceeding this should raise an alarm because performance degrades near saturation.
Thresholds must consider traffic spikes and reserve extra capacity.
Core services should have rate‑limiting to prevent overload.
1.5 Request Latency
Latency is hard to measure. The diagram below shows an e‑commerce system where a composite service calls order, inventory, and account services.
The composite service takes 2 s, the account service 3 s, and the client’s configured read timeout is 5 s.
Monitoring should trigger an alarm if, for example, 100 requests exceed 5 s within a 1‑second window.
The client’s read timeout must not be too large; if the server fails, a fast‑fail ( fail‑fast ) behavior prevents resource leakage.
1.6 Monitoring Considerations
Monitoring aims to let operators quickly detect production problems and pinpoint causes. Important considerations include:
Set sampling frequency based on monitoring goals; high frequency increases cost.
Achieve high coverage of core system metrics.
Balance the number of metrics – too many reduce alert effectiveness.
Alarm timeliness – batch jobs may use delayed alerts (e.g., trigger at 8 am).
Avoid using averages for long‑tail distributions; use bucketed latency counts instead.
Group latency by intervals (e.g., < 1 s, 1‑2 s, 2‑3 s) and configure thresholds exponentially.
2. Fault Management
2.1 Common Fault Causes
Typical fault origins include:
Release upgrades
Hardware failures
System overload
Malicious attacks
Underlying service failures
2.2 Response Strategies
Fault handling proceeds in two steps:
Immediately fix the issue (e.g., correct bad data).
Identify the root cause via logs or tracing and resolve it.
2.2.1 Software Upgrade Faults
Some upgrade issues surface quickly, others only after prolonged operation. For the former, use canary releases; for the latter, improve test coverage.
2.2.2 Hardware Resource Faults
Hardware faults split into overload (e.g., insufficient memory) and aging. Overload is addressed by alerts and resource expansion; aging requires tracking and timely replacement.
2.3 System Overload
Overload may stem from traffic spikes (e.g., flash sales) or gradual growth. Mitigate by adding resources or applying rate limiting.
2.4 Malicious Attacks
Attack types include DDoS, malware, browser exploits. Defenses involve encrypting requests, firewalls, regular scans, and non‑default service ports.
2.5 Underlying Software Faults
All components besides business services are foundational software that must be highly available (see diagram).
3. Release Management
Release covers software (and hardware) upgrades for business systems.
3.1 Release Process
A typical upgrade workflow is illustrated below:
Deploy to production and verify success.
3.2 Release Quality
Ensuring release quality requires a checklist and confirmation of all items before building and deploying.
3.2.1 CheckList
Typical checklist items include:
Correct SQL scripts
Complete production configuration
External dependencies verified
Routing permissions for new machines
Clear ordering of multi‑service releases
Fault‑response plan
3.2.2 Canary Release
Deploy a single server as a canary; if it runs fine, roll out to the rest, otherwise roll back.
3.2.2 Blue‑Green Deployment
Before upgrade, traffic goes to the green environment; after upgrade, switch to blue via load balancer. Keep green as fallback.
Blue‑green requires an extra set of machines, increasing cost compared to canary.
3.2.4 A/B Testing
Run multiple versions in production to compare UI or workflow differences; users choose the preferred version.
A/B versions are already validated, unlike canary releases.
3.2.4 Configuration Changes
When configurations are hard‑coded (e.g., in yaml files), each change requires a new release. To avoid this, use a configuration center or external storage.
4. Capacity Management
Capacity management ensures traffic stays within system limits to prevent crashes. Overload causes include continuous business growth, resource shrinkage, slower request processing, retry‑induced traffic, and sudden spikes.
4.1 Retry
Retries improve user experience but must be limited. Two categories: connection‑timeout retries (usually harmless) and response‑timeout retries (can add load). Excessive retries across a long call chain can overwhelm downstream services.
Recommendations:
Do not retry non‑core services, or limit attempts.
Use exponential back‑off intervals.
Retry based on specific failure codes.
4.2 Burst Traffic
When traffic spikes, first scale resources. Example for Kubernetes:
kubectl scale deployment springboot-deployment --replicas=4If resources are exhausted, apply rate limiting. Popular frameworks include Google Guava, Netflix/concurrency‑limits, and Sentinel.
4.3 Capacity Planning
Early‑stage capacity planning estimates QPS, runs load tests, and adds a safety margin (e.g., 2×) to handle real‑world spikes.
4.4 Service Degradation
Degradation strategies:
Reject new requests when the server is overloaded.
Pause non‑core services to reserve resources for core services.
Clients monitor rejection ratios (e.g., 100 rejections out of 1000 requests in one minute) and trigger client‑side degradation if thresholds are exceeded.
5. Summary
Microservice architectures bring many benefits but also challenges such as service discovery, load balancing, monitoring, release, and access control. Service governance addresses these challenges to keep systems stable.
The presented governance approach is a traditional solution that may involve some code intrusion and framework constraints.
In the cloud‑native era, the emergence of Service Mesh adds a new dimension to service governance, which will be explored in future talks.
Final Note (Please Support)
If this article helped you, please like, watch, share, and bookmark – your support keeps me going.
My Knowledge Planet is open for a 199 CNY subscription, offering extensive resources such as the "Code Monkey Chronic Disease Cloud Management" project, Spring full‑stack series, billion‑scale sharding practice, DDD microservice series, and many more.
For more details, visit the links above.
To join, add my WeChat: special_coder
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.