Tag

Chaos Engineering

1 views collected around this technical thread.

FunTester
FunTester
May 28, 2025 · Cloud Native

Extending Automated Thread Dumps: Log Collection, Resource Monitoring, Chaos Engineering, Performance Analysis, and Environment Cleanup

The article explores how automated thread dumps can be expanded into multiple testing scenarios—including log collection, resource monitoring, fault injection, performance result analysis, and environment cleanup—by leveraging Kubernetes APIs, Prometheus, Chaos Mesh, and scripting tools to improve efficiency, observability, and system resilience.

Chaos EngineeringKubernetesLog Collection
0 likes · 9 min read
Extending Automated Thread Dumps: Log Collection, Resource Monitoring, Chaos Engineering, Performance Analysis, and Environment Cleanup
FunTester
FunTester
May 20, 2025 · Operations

Baseline Metrics for Initiating Chaos Engineering

The article outlines essential baseline metrics—including application, SEV, alert, and infrastructure indicators—required before launching chaos engineering experiments, describes a multi‑stage experiment sequence across known and unknown system areas, and presents best‑practice guidelines for safely conducting chaos tests in production environments.

Chaos Engineeringbaseline metricsdistributed systems
0 likes · 9 min read
Baseline Metrics for Initiating Chaos Engineering
FunTester
FunTester
May 19, 2025 · Operations

Chaos Engineering Tools, Theory, and Practices

Chaos engineering, a scientific method for improving system resilience, is explored through an overview of leading tools such as Gremlin, ChaosBlade, Chaos Mesh, Chaos Toolkit, and ChaosMeta, alongside core concepts, real-world case studies, common misconceptions, and the practical value of controlled fault injection in distributed systems.

Chaos EngineeringFault InjectionReliability
0 likes · 12 min read
Chaos Engineering Tools, Theory, and Practices
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Chaos EngineeringFault InjectionReliability
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
JD Tech
JD Tech
Apr 17, 2025 · Operations

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

This article explains chaos engineering—its definition, core principles, experimental workflow, tool selection, AI‑driven enhancements, and practical case studies—providing a comprehensive guide for building resilient distributed systems across backend, cloud‑native, mobile, and AI‑enabled environments.

AI integrationChaos EngineeringFault Injection
0 likes · 26 min read
Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration
FunTester
FunTester
Mar 25, 2025 · Operations

Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems

This article explores how to embed chaos engineering practices into service dependency governance, detailing dynamic validation versus static analysis, fault injection techniques, multi‑point failure simulations, and data‑driven optimizations to build robust, self‑healing microservice architectures in cloud‑native environments.

Chaos Engineeringcloud nativemicroservices
0 likes · 18 min read
Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems
FunTester
FunTester
Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

Chaos EngineeringFault InjectionKubernetes
0 likes · 8 min read
The Origin, Development, and Future of Chaos Engineering
FunTester
FunTester
Mar 14, 2025 · Operations

Fault Testing: Enhancing System Resilience through Controlled Failure Simulations

The article explains how fault testing—by deliberately injecting failures in a controlled environment—helps identify system weaknesses, validates post‑mortem improvements, and drives architectural optimization, thereby increasing high‑availability and resilience of modern internet services.

Chaos EngineeringHigh Availabilityfault testing
0 likes · 8 min read
Fault Testing: Enhancing System Resilience through Controlled Failure Simulations
FunTester
FunTester
Mar 12, 2025 · Operations

Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices

Fault injection testing deliberately introduces failures into a system to assess its resilience, helping identify weak points, improve retry and timeout mechanisms, and ensure robust operation across software, protocol, and infrastructure layers, with practical guidance on processes, tools, and Kubernetes-specific practices.

Chaos EngineeringFault InjectionKubernetes
0 likes · 8 min read
Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices
FunTester
FunTester
Mar 7, 2025 · Operations

Fault Testing: Proactive Resilience Engineering for Distributed Systems

Fault testing, akin to a shield, deliberately injects failures into distributed and cloud‑native systems to expose weak points, verify recovery mechanisms, and improve overall reliability, ensuring business continuity even under unexpected disruptions.

Chaos Engineeringdistributed systemsfault testing
0 likes · 11 min read
Fault Testing: Proactive Resilience Engineering for Distributed Systems
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Chaos EngineeringRate Limitingcircuit breaker
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
Bilibili Tech
Bilibili Tech
Nov 19, 2024 · Operations

Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons

Bilibili’s infrastructure team created a lightweight, multi‑layered disaster‑recovery drill platform—combining an atomic fault library, scenario catalogs, chaos‑experiment orchestration, real‑time observation, and a product‑level interface—backed by standardized governance and CI‑integrated automation, cutting drill preparation from weeks to days and boosting weekly resilience testing across the organization.

Chaos EngineeringHigh Availabilityautomation
0 likes · 39 min read
Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons
FunTester
FunTester
Sep 20, 2024 · Operations

Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends

This article compares chaos engineering and fault testing, outlines fault injection techniques, implementation layers, testing strategies, challenges, and future trends such as automation, AI-driven diagnostics, and cloud‑native integration, providing a comprehensive guide for improving system resilience and reliability.

Chaos Engineeringcloud nativefault testing
0 likes · 17 min read
Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends
FunTester
FunTester
Sep 19, 2024 · Fundamentals

Software Antifragility: Rethinking Error Handling and Reliability

This paper introduces the concept of software antifragility, drawing on Taleb’s theory to argue that embracing errors through fault tolerance, automatic runtime repair, and fault injection can transform software systems into self‑improving, more robust entities, and discusses implications for development processes and product reliability.

Chaos Engineeringantifragilityfault tolerance
0 likes · 13 min read
Software Antifragility: Rethinking Error Handling and Reliability
Tencent Cloud Developer
Tencent Cloud Developer
Jul 17, 2024 · Operations

Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

By integrating the proactive, static risk assessment of Failure Mode and Effects Analysis with the dynamic fault‑injection validation of chaos engineering, the article demonstrates how cloud‑native architectures—illustrated through a Tencent‑based e‑commerce case—can systematically identify, quantify, and mitigate availability risks, leading to continuous, measurable resilience improvements.

Chaos EngineeringFMEAavailability
0 likes · 16 min read
Combining FMEA and Chaos Engineering to Improve Software Architecture Availability
DataFunSummit
DataFunSummit
May 19, 2024 · Cloud Native

Design and Implementation of a Cloud‑Native Recommendation System Architecture

This article explains how to design and implement a recommendation system by leveraging a four‑layer cloud‑native stack, covering virtualization, micro‑service migration, service governance, elasticity, cloud‑native business capabilities, and chaos‑engineering‑based stability practices to achieve cost‑effective, high‑performance, and reliable recommendation services.

Chaos EngineeringVirtualizationarchitecture
0 likes · 10 min read
Design and Implementation of a Cloud‑Native Recommendation System Architecture
Bilibili Tech
Bilibili Tech
Apr 9, 2024 · Operations

BCM – Building and Deploying Bilibili’s Chaos Engineering Platform

At the 2024 GOPS Global Operations Conference, Bilibili senior R&D engineer Gu Lintao will present BCM—Bilibili’s Chaos Engineering Platform—showcasing how its design and capabilities let developers, testers, and SREs safely inject faults, uncover hidden architectural risks, and improve service stability through real‑world drills and systematic reliability engineering.

BilibiliChaos EngineeringDevOps
0 likes · 3 min read
BCM – Building and Deploying Bilibili’s Chaos Engineering Platform
FunTester
FunTester
Mar 29, 2024 · Operations

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.

Chaos EngineeringFault InjectionHigh Availability
0 likes · 18 min read
Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes
High Availability Architecture
High Availability Architecture
Mar 21, 2024 · Operations

Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes

To improve the ultra‑high availability of WeChat Pay, the team introduced chaos engineering using multi‑partition isolation, controlled blast radius, automated fault injection, and systematic risk discovery, detailing the design, execution, automation, and results of this reliability‑focused initiative.

Chaos EngineeringFault InjectionHigh Availability
0 likes · 18 min read
Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2024 · Operations

Chaos Engineering in WeChat Pay: Design, Implementation, and Results

WeChat Pay’s team adopted Netflix‑style chaos engineering, building an automated, YAML‑driven fault‑injection platform that isolates experiments in multi‑zone partitions, enabling over 500 safe experiments in 2021‑2022, uncovering critical bugs across core services while maintaining five‑nine availability and zero production incidents.

Chaos EngineeringFault InjectionHigh Availability
0 likes · 18 min read
Chaos Engineering in WeChat Pay: Design, Implementation, and Results