Operations 11 min read

Improving System Availability: Stages, Influencing Factors, and Practical Measures

This article explains system availability, outlines three stages of incident handling, identifies key factors that degrade availability such as human error, avalanche effects, untested releases and infrastructure failures, and proposes technical and team‑oriented practices to enhance reliability and achieve higher "nines" of uptime.

DevOps
DevOps
DevOps
Improving System Availability: Stages, Influencing Factors, and Practical Measures

At the beginning, a story about Wei King and the physician Bian Que illustrates three stages of treatment, which the author uses as an analogy for managing system availability throughout its lifecycle.

System availability is defined as the proportion of time a system remains in a working state and is typically measured by Service Level Agreements (SLA). The common "nines" metric quantifies availability, as shown in the table below.

Availability %

Downtime per year

Downtime per month

90%

36.5 days

72 hours

99%

3.65 days

7.20 hours

99.9%

8.76 hours

43.8 minutes

99.99%

52.56 minutes

4.38 minutes

99.999%

5.26 minutes

25.9 seconds

The article divides system incidents into three stages:

Pre‑incident: small, low‑cost interventions that can eliminate the root cause before it manifests.

Early incident: quick, targeted fixes when the problem is still minor.

Severe incident: heavyweight measures (e.g., emergency patches, rollbacks, or even major architectural changes) required to rescue a heavily impacted system.

Major factors that affect availability include:

Human errors such as accidental deletions, running destructive scripts in production, or executing test scripts on live databases.

Avalanche effect in distributed architectures, where a failing service cascades and exhausts resources of dependent services.

Unreleased or insufficiently tested versions that introduce faults during regular deployments.

Infrastructure failures and scheduled maintenance (hardware faults, network issues, OS/database/middleware upgrades, backup and migration tasks).

To improve availability, the author proposes technical actions aligned with the three stages:

Before incident: Implement a robust code‑quality management system, automated testing, permission controls, and other automation tools to prevent untested code from reaching production.

Early incident: Deploy comprehensive monitoring to detect problems early, and maintain CI/CD pipelines for rapid feedback and fast, reliable releases.

Severe incident: Establish release verification, rollback, rate‑limiting, circuit‑breaker, and degradation strategies, as well as disaster‑recovery plans and regular drills to minimize downtime.

From a team perspective, achieving high availability requires a technology‑focused culture: leadership that respects engineering, expert members capable of implementing the above practices, avoidance of short‑term compromises, and strict adherence to team discipline.

In summary, pursuing high system availability is akin to maintaining personal health—continuous vigilance, strong engineering capabilities, and a disciplined, tech‑centric team are essential for reducing downtime and reaching higher "nines" of uptime.

operationsDevOpsincident managementReliabilitysystem availability
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.