Operations 16 min read

Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

By integrating the proactive, static risk assessment of Failure Mode and Effects Analysis with the dynamic fault‑injection validation of chaos engineering, the article demonstrates how cloud‑native architectures—illustrated through a Tencent‑based e‑commerce case—can systematically identify, quantify, and mitigate availability risks, leading to continuous, measurable resilience improvements.

Tencent Cloud Developer

Jul 17, 2024

Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

The article introduces Failure Mode and Effects Analysis (FMEA) as a proactive risk assessment tool originating from the US military in the 1940s, used to identify potential failures in design, process, product, or service and analyze their impacts.

It outlines the FMEA analysis steps applied to software architecture: identifying functional points from a user perspective, describing failure modes, assessing failure impacts, rating severity, analyzing failure causes and probabilities, calculating risk levels, determining existing mitigation measures, planning avoidance and resolution actions, and establishing follow‑up improvements.

The paper then explains chaos engineering as a method for testing distributed systems by deliberately injecting faults to verify resilience, and argues that chaos engineering complements FMEA by providing dynamic, quantitative validation of failure impacts that static FMEA alone cannot reliably estimate.

Using a simple e‑commerce system as a case study, the authors show how to model the architecture with Tencent Cloud Advisor’s cloud architecture tool, perform a static FMEA analysis, inject a host‑network latency fault via the Tencent Cloud Advisor chaos platform, measure the actual impact (e.g., 90% of users experience 3 s login delay at 300 QPS), and derive concrete optimization measures such as adding redundant CLBs, deploying redundant user services, enhancing Nginx request distribution, and setting up cross‑AZ database failover.

The risk level is calculated as 风险程度 = 严重程度 × 故障概率。 After implementing the mitigations, a chaos validation experiment confirms the effectiveness of the fixes, and the article emphasizes the need for continuous availability governance, highlighting Tencent Cloud Advisor’s application management and chaos game‑day features to keep availability scores up to date as the system evolves.

The conclusion stresses that combining FMEA with chaos engineering yields a more effective and efficient approach to availability improvement in cloud‑native architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing chaos engineering Availability FMEA Risk analysis

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.