Operations 10 min read

High Availability: Principles and Practices for System Stability

High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.

DeWu Technology

Oct 17, 2022

High Availability: Principles and Practices for System Stability

This article introduces the concept of high availability (HA) within the "three high" architecture of internet systems—high concurrency, high availability, and high performance—focusing on system stability.

HA is often quantified by the number of nines (e.g., 99.9% uptime), with many companies targeting four nines (≈53 minutes of downtime per year). Achieving this requires coordinated efforts across modules.

Factors affecting stability are grouped into three categories: human factors (improper changes, external attacks), software factors (bugs, design flaws, GC issues, thread‑pool problems, upstream/downstream failures), and hardware factors (network or machine failures).

Key improvement strategies include:

1. System partitioning: split large systems into independent modules (access layer, service layer, database layer) to limit fault impact.

2. Decoupling: replace strong dependencies with weak ones, often using message queues.

3. Technology selection: evaluate middleware and databases based on suitability, community activity, and scalability.

4. Redundant deployment & automatic failover: run multiple service instances and use load‑balancer health checks to redirect traffic when a node fails.

5. Capacity assessment: define expected QPS, latency, CPU usage, and perform load testing to estimate required machine count and storage.

6. Rapid scaling & spill‑over: ensure services are stateless, verify downstream DB connections, and pre‑warm caches.

7. Traffic shaping & circuit breaking: limit request rates and isolate failing components using tools like Sentinel.

8. Resource isolation: allocate dedicated thread or connection pools per downstream service to prevent cascading throttling.

9. System‑wide protection: apply global rate limiting when overall load approaches critical thresholds.

10. Observability & alerting: rely on metrics, traces, and logs to quickly diagnose incidents and set proactive alerts.

11. Change‑management triad: gray releases, rollback mechanisms, and observability of changes to minimize failure risk.

The article concludes that while zero failures are impossible, a systematic approach to prevention and rapid recovery—aligned with business SLOs—can significantly improve system reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability high availability system reliability capacity planning change management fault tolerance

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.