High Availability: Principles and Practices for System Stability
High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.
This article introduces the concept of high availability (HA) within the "three high" architecture of internet systems—high concurrency, high availability, and high performance—focusing on system stability.
HA is often quantified by the number of nines (e.g., 99.9% uptime), with many companies targeting four nines (≈53 minutes of downtime per year). Achieving this requires coordinated efforts across modules.
Factors affecting stability are grouped into three categories: human factors (improper changes, external attacks), software factors (bugs, design flaws, GC issues, thread‑pool problems, upstream/downstream failures), and hardware factors (network or machine failures).
Key improvement strategies include:
1. System partitioning: split large systems into independent modules (access layer, service layer, database layer) to limit fault impact.
2. Decoupling: replace strong dependencies with weak ones, often using message queues.
3. Technology selection: evaluate middleware and databases based on suitability, community activity, and scalability.
4. Redundant deployment & automatic failover: run multiple service instances and use load‑balancer health checks to redirect traffic when a node fails.
5. Capacity assessment: define expected QPS, latency, CPU usage, and perform load testing to estimate required machine count and storage.
6. Rapid scaling & spill‑over: ensure services are stateless, verify downstream DB connections, and pre‑warm caches.
7. Traffic shaping & circuit breaking: limit request rates and isolate failing components using tools like Sentinel.
8. Resource isolation: allocate dedicated thread or connection pools per downstream service to prevent cascading throttling.
9. System‑wide protection: apply global rate limiting when overall load approaches critical thresholds.
10. Observability & alerting: rely on metrics, traces, and logs to quickly diagnose incidents and set proactive alerts.
11. Change‑management triad: gray releases, rollback mechanisms, and observability of changes to minimize failure risk.
The article concludes that while zero failures are impossible, a systematic approach to prevention and rapid recovery—aligned with business SLOs—can significantly improve system reliability.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.