Operations 15 min read

Building a Chaos Engineering Platform for Financial Services: Key Lessons

This talk outlines the challenges of maintaining system stability in fast‑moving, cloud‑native financial services, describes a risk‑identification model, high‑fidelity fault simulation, and a comprehensive stability engineering platform, and shares future plans for automated, data‑driven risk mitigation.

Efficient Ops

Apr 26, 2023

Building a Chaos Engineering Platform for Financial Services: Key Lessons

Background Overview

This article is based on a speaker's presentation at GOPS 2023 Shenzhen. The content is organized into three parts: the background and significance, the operational practice, and the planning and outlook.

During digital transformation, the widespread adoption of cloud‑native, micro‑service, and agile development, together with DevOps, has created huge challenges for ensuring application stability. Three main issues arise: the need for a rapid‑iteration risk‑identification model, high‑fidelity fault simulation, and a systematic technical‑risk‑management framework.

1. Risk‑Identification Model

Fast‑changing market demands have broken monolithic architectures into independent micro‑service modules. While this enables rapid releases, it also increases deployment complexity, lengthens business chains, and makes traditional risk‑assessment methods ineffective. A new model is required to quickly identify and respond to emerging risks.

2. High‑Fidelity Fault Simulation

Traditional drills (manual network cuts, process termination) cannot meet the demand for large‑scale, realistic scenario testing. Both lossless (production‑like) and lossy (controlled) drills are needed, especially for the highly available securities industry, where “explosive‑radius” drills are impractical.

3. Technical Risk Management Framework

While business‑risk control is mature in finance, technical‑risk management lacks a systematic approach. A comprehensive framework should include risk‑management culture, organizational structures, diverse drill formats, platform tools, quantitative evaluation, and closed‑loop governance.

Operational Practice

Three years ago, chaos engineering was introduced, leading to the construction of a stability‑engineering platform and the “Defend Bottom” (保卫波特姆) initiative. The second season reinforced risk identification through special actions, surprise drills, and red‑blue exercises, gradually establishing a risk‑management system.

The five major technical risks identified for the securities industry are:

Single‑point failure

Functional defect

Performance‑capacity

Data loss or corruption

Operational error

These are further divided into 28 secondary categories, forming a comprehensive risk‑prevention framework.

Key practice components include:

Risk discovery: internal/external event analysis, problem, incident, and change management, system testing, and vulnerability assessment.

Scenario construction: expert libraries covering architecture, business functions, steps, and capacity management.

Automated drills: platform‑driven execution (details omitted for brevity).

Measurement & improvement: multi‑dimensional risk analysis, system robustness evaluation, topology verification, drill analytics, and automated response.

The platform’s core capabilities are:

Fault injection: supports Linux and Windows, with over 200 fault scenarios and replay of historical faults.

Quantitative analysis: system stability assessment, user behavior metrics, and multi‑dimensional drill reports.

Automation: batch scenario creation, serial (sequential) and parallel (concurrent) drill execution, enabling efficient testing for engineers managing multiple systems.

Platform views include management, user, dashboard, and API; functions cover drill workflow, resource and drill management, measurement, protection, and technical operations. Fault scenarios span basic, application, business, and special resources, aligned with the five risk categories.

Business‑resource fault types are described as “卡” (slow response), “吊” (process alive but no response), “死” (process dead, error response), and “错” (error return from service or interface).

The platform integrates daily operations, incident management, and risk governance, feeding identified risks into a unified system, executing drills, linking to emergency‑plan automation, and supporting automatic remediation.

Production‑event replay records incidents, generates technical risks, and closes the loop through platform‑driven validation.

Risk discovery also leverages internal/external event analysis, historical incident study, and red‑blue confrontation drills, where cross‑functional teams simulate attacks based on system characteristics.

Organizational structure includes a decision group (overall coordination), a support group (product development and technical support), and an execution group (interface liaison).

2022 goals focused on four business lines and seven drill specials:

Core system single‑point‑failure drill

Internal/external event simulation drill

Trading market feed drill

Full‑link performance‑capacity drill

Data backup‑recovery drill

Settlement reconciliation drill

Infrastructure joint drill

The platform now covers development, testing, and production environments. Because true production‑grade gray testing is infeasible in securities, a full‑scale simulation environment mirroring production (monitoring, logging, dashboards) is used for red‑blue drills.

Over the past year, more than ten online/offline training sessions were held, establishing over 200 technical‑support groups to ensure rapid issue resolution.

Drill formats include:

Special drills: regular, covering the seven specials.

Surprise drills: unannounced tests to verify coverage, response, and emergency coordination.

Red‑blue confrontation: internal blue‑team simulations in the simulation environment.

Metrics show thousands of drills per quarter (peak 6,000), a 26‑fold efficiency increase, risk detection improvement, a 99% closure rate, a 16% rise in proactive monitoring detection, and a 23% reduction in mean‑time‑to‑resolve incidents.

The second season expanded to over 200 SRE engineers, 300+ systems, 3,600 deployment units, 8,000 hosts, and more than 13,000 drills—a 200% increase over the previous season.

Future Planning

Future work focuses on three areas:

Deepening risk mining: expanding the scenario library and building an English‑language capability matrix for intelligent drill recommendation.

Enhancing automation: improving preparation, scenario construction, execution, recovery, verification, and risk governance to achieve near‑zero‑human‑intervention operations.

Multi‑dimensional data analysis: analyzing system, personnel, and execution layers to drive comprehensive digital operations and continuous improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

risk management Operations chaos engineering SRE Financial Services Stability Platform

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.