Operations 20 min read

Service Governance: Monitoring, Fault Management, Release and Capacity Planning

This article explains how to achieve 24/7 service availability through comprehensive monitoring, fault handling, release management, and capacity planning, covering alarm types, batch processing, traffic and resource metrics, fault causes and mitigation, deployment strategies, scaling commands, and service degradation techniques.

Code Ape Tech Column

Jul 26, 2023

Service Governance: Monitoring, Fault Management, Release and Capacity Planning

Hello everyone, I am Chen.

I was woken up at 4 am by a monitoring alarm indicating a failure in a production batch job. After a lengthy investigation I resolved the issue.

The failure occurred in a data‑validation batch that checks the correctness of data produced by previous jobs. Fortunately the core transaction system was unaffected.

Transaction systems are the entry point for business traffic; any outage directly impacts revenue.

The topic of this article is service governance, whose ultimate goal is a "7 × 24" hour uninterrupted service.

1. Monitoring Alarms

The production alarm accurately identified the responsible owner and the failing batch task by monitoring the middleware execution results.

Typical alarm types are shown in the diagram below:

1.1 Batch Processing Efficiency

In most cases batch jobs do not block the business entry, so they do not need monitoring.

When batch jobs do block the entry, they must be monitored. Two business scenarios are:

Domain name system uses DNS and DB records to find dirty data for transaction compensation, during which customers may query dirty data.

Bank end‑of‑day batch processing disallows real‑time transactions, which conflicts with the "7 × 24" goal.

In these scenarios batch efficiency is a critical monitoring metric; timeout thresholds must be configured.

1.2 Traffic Monitoring

Common throttling metrics are illustrated below:

Traffic monitoring should consider:

Different systems use different metrics, e.g., redis can be monitored by QPS, while transaction systems use TPS.

Configure appropriate thresholds based on testing and traffic forecasts.

Account for burst scenarios such as flash sales or coupon grabs.

1.3 Exception Monitoring

Exception monitoring is crucial because production environments cannot guarantee error‑free execution. Proper alerts help quickly locate and resolve issues, as demonstrated by the batch alarm at the article’s start.

Key points for exception monitoring:

Client read timeout – investigate server‑side causes promptly.

Set a response time threshold, e.g., 1 second, to trigger alerts.

Monitor business‑level failures such as error response codes.

1.4 Resource Utilization

When provisioning production resources, predict usage. For example, estimate how long redis memory will last at the current growth rate, or when a database will exhaust disk space.

Set utilization thresholds, e.g., 70%; exceeding this should raise an alarm because performance degrades near saturation.

Thresholds must consider traffic spikes and reserve extra capacity.

Core services should have rate‑limiting to prevent overload.

1.5 Request Latency

Latency is hard to measure. The diagram below shows an e‑commerce system where a composite service calls order, inventory, and account services.

The composite service takes 2 s, the account service 3 s, and the client’s configured read timeout is 5 s.

Monitoring should trigger an alarm if, for example, 100 requests exceed 5 s within a 1‑second window.

The client’s read timeout must not be too large; if the server fails, a fast‑fail ( fail‑fast ) behavior prevents resource leakage.

1.6 Monitoring Considerations

Monitoring aims to let operators quickly detect production problems and pinpoint causes. Important considerations include:

Set sampling frequency based on monitoring goals; high frequency increases cost.

Achieve high coverage of core system metrics.

Balance the number of metrics – too many reduce alert effectiveness.

Alarm timeliness – batch jobs may use delayed alerts (e.g., trigger at 8 am).

Avoid using averages for long‑tail distributions; use bucketed latency counts instead.

Group latency by intervals (e.g., < 1 s, 1‑2 s, 2‑3 s) and configure thresholds exponentially.

2. Fault Management

2.1 Common Fault Causes

Typical fault origins include:

Release upgrades

Hardware failures

System overload

Malicious attacks

Underlying service failures

2.2 Response Strategies

Fault handling proceeds in two steps:

Immediately fix the issue (e.g., correct bad data).

Identify the root cause via logs or tracing and resolve it.

2.2.1 Software Upgrade Faults

Some upgrade issues surface quickly, others only after prolonged operation. For the former, use canary releases; for the latter, improve test coverage.

2.2.2 Hardware Resource Faults

Hardware faults split into overload (e.g., insufficient memory) and aging. Overload is addressed by alerts and resource expansion; aging requires tracking and timely replacement.

2.3 System Overload

Overload may stem from traffic spikes (e.g., flash sales) or gradual growth. Mitigate by adding resources or applying rate limiting.

2.4 Malicious Attacks

Attack types include DDoS, malware, browser exploits. Defenses involve encrypting requests, firewalls, regular scans, and non‑default service ports.

2.5 Underlying Software Faults

All components besides business services are foundational software that must be highly available (see diagram).

3. Release Management

Release covers software (and hardware) upgrades for business systems.

3.1 Release Process

A typical upgrade workflow is illustrated below:

Deploy to production and verify success.

3.2 Release Quality

Ensuring release quality requires a checklist and confirmation of all items before building and deploying.

3.2.1 CheckList

Typical checklist items include:

Correct SQL scripts

Complete production configuration

External dependencies verified

Routing permissions for new machines

Clear ordering of multi‑service releases

Fault‑response plan

3.2.2 Canary Release

Deploy a single server as a canary; if it runs fine, roll out to the rest, otherwise roll back.

3.2.2 Blue‑Green Deployment

Before upgrade, traffic goes to the green environment; after upgrade, switch to blue via load balancer. Keep green as fallback.

Blue‑green requires an extra set of machines, increasing cost compared to canary.

3.2.4 A/B Testing

Run multiple versions in production to compare UI or workflow differences; users choose the preferred version.

A/B versions are already validated, unlike canary releases.

3.2.4 Configuration Changes

When configurations are hard‑coded (e.g., in yaml files), each change requires a new release. To avoid this, use a configuration center or external storage.

4. Capacity Management

Capacity management ensures traffic stays within system limits to prevent crashes. Overload causes include continuous business growth, resource shrinkage, slower request processing, retry‑induced traffic, and sudden spikes.

4.1 Retry

Retries improve user experience but must be limited. Two categories: connection‑timeout retries (usually harmless) and response‑timeout retries (can add load). Excessive retries across a long call chain can overwhelm downstream services.

Recommendations:

Do not retry non‑core services, or limit attempts.

Use exponential back‑off intervals.

Retry based on specific failure codes.

4.2 Burst Traffic

When traffic spikes, first scale resources. Example for Kubernetes:

kubectl scale deployment springboot-deployment --replicas=4

If resources are exhausted, apply rate limiting. Popular frameworks include Google Guava, Netflix/concurrency‑limits, and Sentinel.

4.3 Capacity Planning

Early‑stage capacity planning estimates QPS, runs load tests, and adds a safety margin (e.g., 2×) to handle real‑world spikes.

4.4 Service Degradation

Degradation strategies:

Reject new requests when the server is overloaded.

Pause non‑core services to reserve resources for core services.

Clients monitor rejection ratios (e.g., 100 rejections out of 1000 requests in one minute) and trigger client‑side degradation if thresholds are exceeded.

5. Summary

Microservice architectures bring many benefits but also challenges such as service discovery, load balancing, monitoring, release, and access control. Service governance addresses these challenges to keep systems stable.

The presented governance approach is a traditional solution that may involve some code intrusion and framework constraints.

In the cloud‑native era, the emergence of Service Mesh adds a new dimension to service governance, which will be explored in future talks.

Final Note (Please Support)

If this article helped you, please like, watch, share, and bookmark – your support keeps me going.

My Knowledge Planet is open for a 199 CNY subscription, offering extensive resources such as the "Code Monkey Chronic Disease Cloud Management" project, Spring full‑stack series, billion‑scale sharding practice, DDD microservice series, and many more.

For more details, visit the links above.

To join, add my WeChat: special_coder

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

capacity planning service governance release-management fault management

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.