Cloud Computing 25 min read

Design Principles for High‑Availability System Architecture

The article outlines a comprehensive high‑availability architecture framework across six layers—development standards, application services, storage, product fallback, operations deployment, and emergency response—detailing design principles such as stateless services, elastic scaling, redundant storage, robust monitoring, gray releases, and chaos engineering to ensure resilient, continuously available systems.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Design Principles for High‑Availability System Architecture

During system development many engineers struggle to achieve high availability. This article analyzes the architecture design of a highly available system from six perspectives—development standards, application‑service layer, storage layer, product layer, operations‑deployment layer, and exception‑emergency layer—providing concrete design considerations and best practices.

1. Availability and High‑Availability Concepts

Availability is a quantifiable metric representing the probability or proportion of time a system works correctly within a given observation period. Industry commonly uses the number of "9s" (e.g., 99.99% = four 9s) to express availability. High availability means the service can continue to provide functionality under most circumstances.

2. High‑Availability Design Philosophy

A scientific engineering management routine is required, covering product, development, operations, and infrastructure. The key ideas include:

Establish clear development specifications and standards.

Perform thorough capacity planning and QPS estimation.

Design stateless services with load‑balancing, elastic scaling, asynchronous decoupling, fault‑tolerance, and overload protection.

Implement redundant storage (clustered or distributed).

Define product‑level fallback strategies.

Adopt robust deployment, monitoring, and alerting mechanisms.

Prepare emergency response plans.

3. Development‑Standard Layer

Design documentation templates, mandatory design reviews, coding standards, layout conventions, code‑review processes, unit‑test coverage (target ≥50%), and logging standards (remote logging, distributed tracing) are essential to ensure code quality and maintainability.

4. Application‑Service Layer

Key designs include:

Stateless architecture + load balancing (service discovery, health checks, automatic removal of faulty instances). If micro‑service frameworks are unavailable, use external load balancers such as LVS, Nginx, or cloud CLB.

Elastic scaling based on CPU or custom metrics, typically realized with Kubernetes HPA or custom scripts for bare‑metal deployments.

Asynchronous decoupling and traffic‑shaping via message queues (e.g., Kafka) to achieve peak‑smoothing and isolation between producers and consumers.

Fault‑tolerance design following the "fail‑fast" principle and self‑protection mechanisms (circuit‑breaker, degradation, fallback).

Overload protection (rate limiting, circuit‑breaker, degradation) to keep core functionality available under traffic spikes.

5. Storage Layer

High availability for data storage relies on redundancy:

Clustered storage (master‑slave or master‑multiple‑replicas) – focuses on data replication, master health detection, and automatic failover.

Distributed storage (e.g., COS, GooseFS, HDFS, HBase, Elasticsearch) – spreads data across many nodes, eliminating single‑point bottlenecks and enabling horizontal scaling.

6. Product Layer

Product‑level fallback strategies include default pages when data is unavailable, graceful degradation during maintenance, and placeholder content for features such as lotteries.

7. Operations‑Deployment Layer

Key practices:

Gray‑release and interface testing during the development stage – release a few instances first, monitor, then gradually roll out.

Monitoring and alerting design – use open‑source stacks (ELK, Prometheus, OpenTracing, OpenTelemetry) to collect logs, metrics, and traces across infrastructure, OS, application, and business dimensions.

Alert system requirements: real‑time (seconds), full‑coverage, multi‑level severity, and multiple notification channels (SMS, email, dashboard).

Multi‑datacenter deployment for disaster recovery – services can be discovered across regions; storage multi‑region deployment is more complex due to state synchronization.

Chaos engineering (fault injection, e.g., simulating power loss, network cut) to validate system resilience.

Interface probing (periodic health checks) to detect abnormal endpoints and trigger alerts.

8. Exception‑Emergency Layer

Even with all safeguards, unexpected incidents may occur. An emergency plan should be pre‑defined, covering rapid recovery procedures, communication flows, and post‑mortem analysis to minimize impact.

The article concludes with a mind‑map summarizing the six layers and their critical design points, encouraging developers to reference Tencent’s experience when building highly available systems.

monitoringsystem architecturescalabilitydeploymenthigh availabilityCapacity PlanningFault Tolerance
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.