Operations 16 min read

Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

By integrating the proactive, static risk assessment of Failure Mode and Effects Analysis with the dynamic fault‑injection validation of chaos engineering, the article demonstrates how cloud‑native architectures—illustrated through a Tencent‑based e‑commerce case—can systematically identify, quantify, and mitigate availability risks, leading to continuous, measurable resilience improvements.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

The article introduces Failure Mode and Effects Analysis (FMEA) as a proactive risk assessment tool originating from the US military in the 1940s, used to identify potential failures in design, process, product, or service and analyze their impacts.

It outlines the FMEA analysis steps applied to software architecture: identifying functional points from a user perspective, describing failure modes, assessing failure impacts, rating severity, analyzing failure causes and probabilities, calculating risk levels, determining existing mitigation measures, planning avoidance and resolution actions, and establishing follow‑up improvements.

The paper then explains chaos engineering as a method for testing distributed systems by deliberately injecting faults to verify resilience, and argues that chaos engineering complements FMEA by providing dynamic, quantitative validation of failure impacts that static FMEA alone cannot reliably estimate.

Using a simple e‑commerce system as a case study, the authors show how to model the architecture with Tencent Cloud Advisor’s cloud architecture tool, perform a static FMEA analysis, inject a host‑network latency fault via the Tencent Cloud Advisor chaos platform, measure the actual impact (e.g., 90% of users experience 3 s login delay at 300 QPS), and derive concrete optimization measures such as adding redundant CLBs, deploying redundant user services, enhancing Nginx request distribution, and setting up cross‑AZ database failover.

The risk level is calculated as 风险程度 = 严重程度 × 故障概率。

After implementing the mitigations, a chaos validation experiment confirms the effectiveness of the fixes, and the article emphasizes the need for continuous availability governance, highlighting Tencent Cloud Advisor’s application management and chaos game‑day features to keep availability scores up to date as the system evolves.

The conclusion stresses that combining FMEA with chaos engineering yields a more effective and efficient approach to availability improvement in cloud‑native architectures.

software architecturecloud computingChaos EngineeringavailabilityFMEArisk analysis
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.