Tagged articles

66 articles

Page 1 of 1

Apr 5, 2026 · Information Security

Adversarial Testing: Strengthening Software Security Beyond Traditional Reliability

The article examines how adversarial testing—originating from machine‑learning robustness checks—has expanded to a full‑stack security practice, detailing intelligent generation techniques, DevSecOps integration, real‑world incidents, and its emerging role as a core resilience standard.

AI RobustnessDevSecOpsResilience

0 likes · 6 min read

Adversarial Testing: Strengthening Software Security Beyond Traditional Reliability

ByteDance Data Platform

Feb 2, 2026 · Big Data

How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale

ByteDance’s StreamShield delivers a three‑layer resiliency framework—engine self‑healing, hybrid replication at the cluster level, and chaos‑tested releases—that enables over 70,000 concurrent Flink jobs on 11 million CPU cores to meet strict SLAs with second‑level startup and robust fault tolerance.

Apache FlinkByteDanceReal‑Time Computing

0 likes · 6 min read

How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale

Ray's Galactic Tech

Jan 27, 2026 · Backend Development

Resilient Go Microservices: Rate Limiting, Circuit Breaking & K8s Architecture

This guide walks you through implementing a complete stability engineering system for Go microservices—covering token‑bucket rate limiting, concurrency and Redis‑based throttling, circuit breakers with slow‑request detection, graceful degradation strategies, Kubernetes‑aware deployment, monitoring, dynamic configuration, and load‑testing to set safe thresholds.

Circuit BreakerRate LimitingResilience

0 likes · 10 min read

Resilient Go Microservices: Rate Limiting, Circuit Breaking & K8s Architecture

dbaplus Community

Dec 21, 2025 · Operations

5 Must‑Have Soft Skills for Ops Engineers to Future‑Proof Their Careers

In a rapidly changing tech landscape where Kubernetes and AI dominate, seasoned ops professionals share five core soft‑skill abilities—communication, problem solving, ownership, resilience, and continuous learning—that amplify technical expertise and drive promotions, salary growth, and long‑term career value.

Resiliencecareer developmentcommunication

0 likes · 11 min read

5 Must‑Have Soft Skills for Ops Engineers to Future‑Proof Their Careers

21CTO

Dec 3, 2025 · Operations

What My Biggest Developer Mistakes Taught Me About Operations and Resilience

A software engineer recounts three major mistakes—from accidentally deleting thousands of F5 URLs to leaking code externally and being laid off during COVID—highlighting how operational oversights, poor process controls, and personal resilience shape professional growth and underscore the value of empathy and systematic safeguards.

InfrastructureProcess ImprovementResilience

0 likes · 14 min read

What My Biggest Developer Mistakes Taught Me About Operations and Resilience

21CTO

Nov 18, 2025 · Operations

What Cloudflare’s Latest Outage Reveals About Cloud Dependency Risks

A massive Cloudflare outage on November 18, 2025 crippled DNS and CDN services, causing widespread failures for platforms like ChatGPT and Discord, and the article analyzes the incident, past failures, and offers four practical resilience strategies to mitigate over‑reliance on single cloud providers.

CDNCloudflareDNS

0 likes · 7 min read

What Cloudflare’s Latest Outage Reveals About Cloud Dependency Risks

dbaplus Community

Nov 8, 2025 · Cloud Native

Why the 2025 AWS Outage Shows Kubernetes Is the Key to True Multi‑Cloud Resilience

The 2025 AWS us‑east‑1 outage exposed the fragility of single‑cloud architectures and demonstrates how Kubernetes can provide a cloud‑native abstraction that enables true multi‑cloud portability, faster CI/CD pipelines, and resilient, cost‑effective infrastructure for modern software development.

Cloud NativeResilienceaws-outage

0 likes · 10 min read

Why the 2025 AWS Outage Shows Kubernetes Is the Key to True Multi‑Cloud Resilience

Cognitive Technology Team

Oct 12, 2025 · Backend Development

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Learn how to tame chaotic microservices with practical resilience patterns—circuit breakers, bulkheads, smart retries, timeouts with fallbacks, and event‑driven messaging—plus tool recommendations and observability tips that ensure your system stays responsive even when individual services fail.

ObservabilityResilienceRetry

0 likes · 9 min read

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Programmer DD

Oct 10, 2025 · Artificial Intelligence

How to Build a Resilient Multi‑LLM Chatbot with Spring AI

This tutorial demonstrates how to integrate multiple large language models from different providers into a Spring Boot application using Spring AI, configure primary, secondary, and tertiary models, and implement a fallback mechanism with Spring Retry to ensure high availability of the chatbot.

JavaLLMResilience

0 likes · 12 min read

How to Build a Resilient Multi‑LLM Chatbot with Spring AI

IT Architects Alliance

Oct 2, 2025 · Cloud Native

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

This article outlines six fundamental cloud‑native architecture principles—immutable infrastructure, service mesh, observability, declarative APIs, resilient design, and shift‑left security—explaining their purpose, key practices, code examples, and how they interrelate to build scalable, reliable, and secure distributed systems.

Cloud NativeDeclarative APIObservability

0 likes · 11 min read

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

FunTester

Sep 14, 2025 · Operations

Essential Fault Testing & Chaos Engineering Resources: Articles, Guides, and Byteman Tutorials

This curated collection presents dozens of Chinese articles and guides on fault testing, chaos engineering, and Byteman usage, covering topics such as SACK, delayed ACK, RTT, socket buffers, HTTP timeouts, and practical Byteman techniques, each with publication dates for quick reference.

BytemanResiliencechaos engineering

0 likes · 9 min read

Essential Fault Testing & Chaos Engineering Resources: Articles, Guides, and Byteman Tutorials

Java Architecture Diary

Jul 28, 2025 · Backend Development

How Spring Framework 7.0 Simplifies Retry and Concurrency with Built‑in Resilience

Spring Framework 7.0 introduces built‑in resilience annotations @Retryable and @ConcurrencyLimit, eliminating the need for external spring‑retry dependencies and enabling declarative retry, exponential backoff, and concurrency throttling—including reactive support—so developers can write cleaner, more robust Java backend services.

ConcurrencyLimitJavaResilience

0 likes · 7 min read

How Spring Framework 7.0 Simplifies Retry and Concurrency with Built‑in Resilience

Su San Talks Tech

Jul 13, 2025 · Backend Development

8 Proven Retry Strategies to Prevent Costly Failures in Distributed Systems

Discover why improper retry logic can cause massive financial losses, learn eight practical retry solutions—from simple loops to advanced Resilience4j and distributed lock techniques—and see how to avoid retry storms, ensure idempotency, and protect resources in high‑traffic backend services.

Distributed SystemsIdempotencyResilience

0 likes · 13 min read

8 Proven Retry Strategies to Prevent Costly Failures in Distributed Systems

Java One

Jul 12, 2025 · Backend Development

Mastering Alibaba Sentinel: Flow Control, Circuit Breaking, and Hotspot Rules in Production

This guide walks through Alibaba Sentinel's core protection strategies—flow‑control rules (including QPS and concurrency limits, modes, and effects), circuit‑breaker mechanisms (principles and three strategies), and hotspot parameter limiting—providing detailed configuration steps, code samples, and visual illustrations for real‑world microservice environments.

Alibaba SentinelCircuit BreakerFlow Control

0 likes · 18 min read

Mastering Alibaba Sentinel: Flow Control, Circuit Breaking, and Hotspot Rules in Production

Ops Development & AI Practice

Jul 3, 2025 · Operations

Why Event-Driven Architecture Is the Secret Sauce for Resilient Ops

The article explains how Event‑Driven Architecture (EDA) transforms traditional request‑response systems into decoupled, asynchronous pipelines that boost system resilience, scalability, observability, and agility, and it demonstrates a practical AWS EventBridge image‑processing workflow.

AWS EventBridgeEDAEvent-Driven Architecture

0 likes · 10 min read

Why Event-Driven Architecture Is the Secret Sauce for Resilient Ops

DaTaobao Tech

Apr 28, 2025 · Frontend Development

Front‑End Architecture and Performance Optimization for a Large‑Scale Chinese New Year Interactive Activity

The article details a large‑scale Chinese New Year interactive activity’s front‑end architecture, describing a layered system for business logic, data abstraction, and animation engines, unified data handling, dynamic animation rendering with downgrade paths, high‑concurrency QPS reduction, resilience measures, and extensive performance and workflow optimizations.

AnimationData ManagementFrontend

0 likes · 15 min read

Front‑End Architecture and Performance Optimization for a Large‑Scale Chinese New Year Interactive Activity

Su San Talks Tech

Apr 27, 2025 · Backend Development

Mastering Microservices: Advantages, Challenges, and Essential Design Patterns

This article explains what microservices are, outlines their key advantages such as scalability and resilience, details the inherent challenges like complexity and security, and introduces essential design patterns—including Database‑Per‑Service, API Gateway, BFF, CQRS, Event Sourcing, Saga, Sidecar, Circuit Breaker, ACL, and Aggregator—to help architects build robust, maintainable systems.

Backend ArchitectureCloud NativeMicroservices

0 likes · 23 min read

Mastering Microservices: Advantages, Challenges, and Essential Design Patterns

Cognitive Technology Team

Apr 11, 2025 · Backend Development

Hystrix Service Isolation: Thread‑Pool and Semaphore Isolation Patterns

The article explains how Hystrix uses thread‑pool and semaphore isolation to prevent cascading failures in microservice architectures, detailing implementation, configuration defaults, suitable scenarios, and recommendations for building resilient distributed systems.

HystrixMicroservicesResilience

0 likes · 5 min read

Hystrix Service Isolation: Thread‑Pool and Semaphore Isolation Patterns

FunTester

Mar 31, 2025 · Operations

Performance Testing and Fault Testing: Complementary Pillars for System Stability

The article explains how performance testing measures system efficiency under load while fault testing validates resilience under abnormal conditions, highlighting their shared goals, differences, overlapping toolchains, and how their combined use drives architecture optimization and improves service level agreements in modern complex software systems.

Fault InjectionOperationsResilience

0 likes · 14 min read

Performance Testing and Fault Testing: Complementary Pillars for System Stability

FunTester

Mar 25, 2025 · Operations

Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems

This article explores how to embed chaos engineering practices into service dependency governance, detailing dynamic validation versus static analysis, fault injection techniques, multi‑point failure simulations, and data‑driven optimizations to build robust, self‑healing microservice architectures in cloud‑native environments.

Cloud NativeMicroservicesOperations

0 likes · 18 min read

Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems

FunTester

Mar 7, 2025 · Operations

Fault Testing: Proactive Resilience Engineering for Distributed Systems

Fault testing, akin to a shield, deliberately injects failures into distributed and cloud‑native systems to expose weak points, verify recovery mechanisms, and improve overall reliability, ensuring business continuity even under unexpected disruptions.

OperationsResiliencechaos engineering

0 likes · 11 min read

Fault Testing: Proactive Resilience Engineering for Distributed Systems

Architect

Jan 25, 2025 · Backend Development

HTTP Retry Strategies in Offline Store Systems: Simple Loop, Apache HttpClient, and MQ‑Based Asynchronous Retries

This article explores practical HTTP retry solutions for offline store applications, covering a basic loop retry, the built‑in retry mechanism of Apache HttpClient with custom handlers, and an asynchronous retry approach using message queues to achieve higher reliability and eventual consistency.

Apache HttpClientHTTPJava

0 likes · 12 min read

HTTP Retry Strategies in Offline Store Systems: Simple Loop, Apache HttpClient, and MQ‑Based Asynchronous Retries

JavaEdge

Oct 21, 2024 · Operations

Why Move Beyond Microservices? Unlocking Resilience with Unitized Architecture

This article explores the advantages of unitized architecture over traditional microservices, detailing how its modular design, dedicated routing layer, and tailored observability practices enhance system resilience, fault‑tolerance, and operational insight for large‑scale distributed applications.

Distributed SystemsResiliencefault tolerance

0 likes · 17 min read

Why Move Beyond Microservices? Unlocking Resilience with Unitized Architecture

dbaplus Community

Oct 3, 2024 · Operations

How Netflix Uses Chaos Engineering to Build Resilient Distributed Systems

This article explains Netflix's chaos engineering practice, detailing the challenges of microservice reliability, the implementation of the Chaos Monkey tool, the step‑by‑step methodology, guiding principles, and real‑world outcomes that demonstrate improved system availability.

Chaos MonkeyDistributed SystemsNetflix

0 likes · 6 min read

How Netflix Uses Chaos Engineering to Build Resilient Distributed Systems

JavaEdge

Aug 13, 2024 · Backend Development

How to Use Circuit Breakers to Decouple Event Retrieval in Microservices

This article explains why tightly coupled request/response communication can overload downstream services, introduces the circuit‑breaker pattern (including its three‑state state machine), and shows step‑by‑step how to integrate a circuit breaker into event‑driven microservices to pause event retrieval, handle state transitions, and avoid dead‑letter queues.

Circuit BreakerResilience

0 likes · 9 min read

How to Use Circuit Breakers to Decouple Event Retrieval in Microservices

DevOps Coach

Jun 27, 2024 · Operations

How to Run Effective Incident Response Drills for Resilient Systems

This article explains why regular disaster role‑playing, systematic testing, and focused responder preparation are essential for building robust incident response capabilities and reducing operational risk in production environments.

Incident ResponseOperationsResilience

0 likes · 7 min read

How to Run Effective Incident Response Drills for Resilient Systems

Architect

Dec 22, 2023 · Operations

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

The article details Tencent Search’s end‑to‑end stability engineering practice, covering a ten‑step architecture that combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous drills, and shows how these measures collectively reduced mean‑time‑to‑detect and mean‑time‑to‑recover by an order of magnitude while keeping service availability high.

Incident ManagementObservabilityResilience

0 likes · 32 min read

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

政采云技术

Nov 29, 2023 · Frontend Development

API Failure Resilience Using CDN and IndexedDB Caching

The article presents a comprehensive strategy for handling API outages by storing data locally with IndexedDB, synchronizing updates through a CDN, and implementing Axios interceptors and Node‑based scheduled jobs to ensure seamless user experience without white‑screen failures.

APICDNCaching

0 likes · 12 min read

API Failure Resilience Using CDN and IndexedDB Caching

Spring Full-Stack Practical Cases

Nov 9, 2023 · Backend Development

Preventing Service Avalanche with Hystrix: Strategies and Code Samples

This article explains how synchronous service calls can cause thread exhaustion and cascading failures known as the avalanche effect, and demonstrates how to use Hystrix's circuit‑breaker, isolation, and fallback features with practical Java code to protect backend systems.

HystrixResilienceavalanche effect

0 likes · 10 min read

Preventing Service Avalanche with Hystrix: Strategies and Code Samples

Architects Research Society

Oct 3, 2023 · Cloud Native

Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started

Chaos engineering is a disciplined approach to testing distributed systems by intentionally injecting failures to verify resilience, covering its definition, origins at Netflix, operational workflow, benefits, challenges, and practical steps for organizations to adopt resilient cloud‑native applications.

ObservabilityResiliencechaos engineering

0 likes · 18 min read

Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started

MaGe Linux Operations

Jun 24, 2023 · Backend Development

12 Essential Microservice Patterns to Boost Scalability and Resilience

This article explains why microservice architecture matters and walks software engineers through twelve core design patterns—such as API Gateway, Service Discovery, Circuit Breaker, and Strangler—that together improve system scalability, fault‑tolerance, performance, and maintainability.

MicroservicesResilienceScalability

0 likes · 17 min read

12 Essential Microservice Patterns to Boost Scalability and Resilience

ByteDance SYS Tech

Feb 28, 2023 · Cloud Native

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

This article explains ByteDance’s end‑to‑end chaos engineering practice for cloud‑native environments, covering its background, principles, comparison with traditional testing, the evolution of its internal platforms, and a detailed look at the Application Resilience Enhancement Service (ARES) and its core features.

Fault InjectionKubernetesMicroservices

0 likes · 17 min read

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

Architects Research Society

Oct 10, 2022 · R&D Management

Future‑Ready CIO Leadership: Insights from Three Executives

The article explores how business‑driven CIOs are updating their leadership playbooks for the future of work, emphasizing adaptability, resilience, proactive problem‑solving, and a people‑first culture, based on interviews with CIOs from GEHA Health, Panera Bread, and Novant Health.

AdaptabilityCIODigitalTransformation

0 likes · 10 min read

Future‑Ready CIO Leadership: Insights from Three Executives

IT Architects Alliance

Jun 20, 2022 · Cloud Native

Building Resilient Microservices: Fault Tolerance, Graceful Degradation, and Reliability Patterns

This article explains how microservice architectures can achieve high availability by using fault‑tolerant designs such as graceful degradation, health checks, failover caching, circuit breakers, bulkheads, rate limiting, and systematic change‑management practices to mitigate network, hardware, and application errors.

Circuit BreakerMicroservicesResilience

0 likes · 13 min read

Building Resilient Microservices: Fault Tolerance, Graceful Degradation, and Reliability Patterns

Laiye Technology Team

Jun 10, 2022 · Backend Development

Understanding System Failures and Principles for Resilient Architecture

The article analyzes why modern software systems repeatedly collapse—due to growing business complexity, unpredictable changes, and architectural decay—and proposes principles such as decentralization, integration, and diversity, along with practical strategies like service mesh and eBPF, to design more sustainable, observable, and self‑evolving architectures.

Distributed SystemsMicroservicesResilience

0 likes · 12 min read

Understanding System Failures and Principles for Resilient Architecture

IT Architects Alliance

May 28, 2022 · Operations

Why Circuit Breaking and Degradation Are Essential for High‑Availability Microservices

The article explains how microservice architectures can suffer from cascading failures, why circuit breaking and degradation are critical for protecting service availability, compares popular libraries such as Sentinel, Hystrix and Resilience4j, and dives deep into Sentinel's degradation implementation, rule definition, data collection, verification, and execution flow.

Circuit BreakingMicroservicesResilience

0 likes · 12 min read

Why Circuit Breaking and Degradation Are Essential for High‑Availability Microservices

DevOps

May 18, 2022 · Operations

Understanding and Preventing Cascading Failures in Distributed Systems

The article explains how cascading failures arise from positive feedback loops in distributed systems, illustrates real‑world incidents such as the 2015 DynamoDB outage, outlines anti‑patterns like unlimited retries and unchecked load, and presents practical mitigation techniques including load‑shedding, circuit breakers, exponential back‑off, and controlled replication to improve system resilience.

Circuit BreakerDistributed SystemsResilience

0 likes · 19 min read

Understanding and Preventing Cascading Failures in Distributed Systems

Cloud Native Technology Community

May 11, 2022 · Cloud Native

Ensuring Resilience for Stateful Kubernetes: A Multi‑Cloud Scalable Storage Solution

The article explains the challenges of running stateful applications on Kubernetes—including complexity, vendor lock‑in, elasticity limits, and bloated infrastructure—and presents multi‑cloud, one‑click storage platforms that provide resilient, portable workloads without data loss.

Cloud NativeKubernetesResilience

0 likes · 5 min read

Ensuring Resilience for Stateful Kubernetes: A Multi‑Cloud Scalable Storage Solution

Architecture Digest

May 8, 2022 · Fundamentals

Building Robust Distributed Systems: Reducing Dependencies and Enhancing Resilience

The article explains how to design resilient distributed systems by minimizing inter‑component dependencies, duplicating or denormalizing data, isolating failures with SLAs, protecting callers and callees, and adding buffers such as asynchronous messaging and elastic scaling to handle random faults as systems grow.

MicroservicesResilienceSLA

0 likes · 8 min read

Building Robust Distributed Systems: Reducing Dependencies and Enhancing Resilience

Su San Talks Tech

Mar 14, 2022 · Backend Development

Master OpenFeign: From Basics to Advanced Timeout, Logging, and Resilience

This tutorial walks you through OpenFeign in Spring Cloud, explaining its purpose, differences from Feign, setup steps, various parameter passing methods, timeout handling, logging enhancement, HTTP client replacement, GZIP compression, and circuit‑breaker integration with Sentinel, all illustrated with code snippets and diagrams.

JavaMicroservicesOpenFeign

0 likes · 19 min read

Master OpenFeign: From Basics to Advanced Timeout, Logging, and Resilience

IT Architects Alliance

Mar 13, 2022 · Operations

30 Essential Architecture Patterns for Scalable, Resilient Systems

This article presents a comprehensive catalog of thirty architectural patterns—including management, monitoring, performance, data management, design, messaging, resilience, and security modes—explaining their purpose, typical use cases, benefits, and implementation considerations to help engineers build robust, high‑performance distributed applications.

Architecture PatternsOperationsResilience

0 likes · 32 min read

30 Essential Architecture Patterns for Scalable, Resilient Systems

IT Architects Alliance

Mar 10, 2022 · Backend Development

Building Resilient Microservices: Patterns and Practices for High Availability

This article explains the risks of microservice architectures and presents a collection of reliability patterns—including graceful degradation, change management, health checks, self‑healing, failover caching, retries, rate limiting, bulkheads, and circuit breakers—to help engineers design and operate highly available backend services.

Circuit BreakerMicroservicesResilience

0 likes · 17 min read

Building Resilient Microservices: Patterns and Practices for High Availability

Alibaba Cloud Native

Aug 27, 2021 · Operations

How Chaos Engineering Strengthens System Resilience: Building a Fault‑Injection Platform

This article explains why modern agile and DevOps environments need chaos engineering, describes the design and goals of a fault‑injection platform, outlines tool selection, details a five‑step exercise workflow, and shares a real‑world case study that demonstrates improved stability and SRE capabilities.

PlatformResilienceSRE

0 likes · 10 min read

How Chaos Engineering Strengthens System Resilience: Building a Fault‑Injection Platform

Java High-Performance Architecture

Jul 1, 2021 · R&D Management

How a Self‑Funded Small Team Built a $1M ARR Cross‑Platform Email Client

This article recounts how Missive’s four‑person, self‑funded team overcame technical and market challenges to create a cloud‑based, cross‑platform email client that reached $1 million ARR, highlighting funding strategy, team roles, architecture decisions, customer acquisition, and the importance of resilience.

Product DevelopmentResilienceStartup

0 likes · 10 min read

How a Self‑Funded Small Team Built a $1M ARR Cross‑Platform Email Client

Top Architect

May 24, 2021 · Backend Development

Understanding Hystrix: Service Isolation, Circuit Breaking, and Monitoring in Spring Cloud

This article explains why Hystrix is needed for fault tolerance in distributed systems, describes its key features such as circuit breaking, thread and semaphore isolation, fallback mechanisms, request collapsing, and monitoring, and provides step‑by‑step configuration examples and code snippets for integrating Hystrix into Spring Cloud microservices.

HystrixResiliencecircuit-breaker

0 likes · 18 min read

Understanding Hystrix: Service Isolation, Circuit Breaking, and Monitoring in Spring Cloud

Yang Money Pot Technology Team

May 18, 2021 · Backend Development

Understanding Hystrix: Resilience Patterns, Execution Flow, and Custom Extensions

This article explains how Hystrix implements resiliency patterns such as bulkhead, circuit breaker, retry, and degradation for microservice calls, details its execution workflow, core components, dynamic configuration, isolation strategies, metrics collection, and practical usage, and discusses future alternatives and extensions.

CircuitBreakerDistributedSystemsJava

0 likes · 33 min read

Understanding Hystrix: Resilience Patterns, Execution Flow, and Custom Extensions

Architects Research Society

Apr 30, 2021 · Operations

Health Management and Diagnostics in Microservices

The article explains how microservices can achieve resilience through health reporting, diagnostics, standardized logging, health‑check implementations, and orchestrator coordination to detect failures, restart services, handle upgrades, and recover from partial cloud‑based failures.

ObservabilityOrchestrationResilience

0 likes · 9 min read

Health Management and Diagnostics in Microservices

Wukong Talks Architecture

Oct 28, 2020 · Operations

From the Battle of Red Cliffs to Service Avalanche: Understanding Circuit Breaker and Resilience in Microservices

This article uses the historic Battle of Red Cliffs as an analogy to explain service avalanche in micro‑service architectures, analyzes its causes, presents real‑world scenarios, and details circuit‑breaker concepts, algorithms, recovery strategies, and practical mitigation techniques.

Circuit BreakerResilienceService Avalanche

0 likes · 10 min read

From the Battle of Red Cliffs to Service Avalanche: Understanding Circuit Breaker and Resilience in Microservices

Architects' Tech Alliance

Oct 12, 2020 · Operations

Designing Resilient Microservices: Patterns for Fault Tolerance and Failure Management

This article examines the inherent risks of microservice architectures and presents practical patterns—such as graceful degradation, change management, health checks, self‑healing, fallback caching, retries, rate limiting, bulkheads, and circuit breakers—to build highly available, fault‑tolerant services.

Circuit BreakerMicroservicesRate Limiting

0 likes · 15 min read

Designing Resilient Microservices: Patterns for Fault Tolerance and Failure Management

Meituan Technology Team

Sep 30, 2020 · Information Security

Security Control Algorithms for Cyber‑Physical Systems

Professor Mo Yilin explained that securing cyber‑physical systems—such as autonomous vehicles and smart grids—requires a multi‑layered approach combining control‑theoretic redundancy, active watermark‑based intrusion detection, resilient estimation, and data‑driven design to maintain safe operation despite networked attacks and replay threats, ensuring reliability of critical infrastructure.

Resiliencecontrol algorithmscyber-physical systems

0 likes · 25 min read

Security Control Algorithms for Cyber‑Physical Systems

IT Architects Alliance

Sep 27, 2020 · Backend Development

How Circuit Breakers Safeguard Microservices: A Deep Dive into Resilience

This article explains the concept, states, and practical benefits of circuit breaker mechanisms in microservice architectures, illustrating how they prevent cascading failures, improve system stability, and provide configurable recovery strategies for robust cloud‑native applications.

Circuit BreakerCloud NativeMicroservices

0 likes · 13 min read

How Circuit Breakers Safeguard Microservices: A Deep Dive into Resilience

Java Architect Essentials

Aug 26, 2020 · Backend Development

A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture

This article walks through the transformation of a simple online supermarket from a monolithic design to a fully fledged microservice system, explaining the motivations, architectural changes, component selection, common pitfalls, and best‑practice solutions such as service decomposition, database sharding, monitoring, tracing, service mesh, resilience patterns, and testing strategies.

MicroservicesMonitoringResilience

0 likes · 22 min read

A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture

Efficient Ops

Mar 10, 2020 · Operations

How to Build Anti‑Fragile Operations in the Cloud Era

This article explains the anti‑fragility concept, illustrates how cloud‑based systems become increasingly vulnerable to unexpected events, and offers practical strategies—including risk reduction, choice diversification, proactive experimentation, and biologically inspired resilience—to transform operations and turn shocks into opportunities.

Anti-FragilityCloud ComputingDevOps

0 likes · 19 min read

How to Build Anti‑Fragile Operations in the Cloud Era

JD Retail Technology

Mar 5, 2020 · Backend Development

Technical Implementation and Resilience Practices of JD.com PC Homepage

This article details the architectural redesign, fault‑tolerance mechanisms, performance optimizations, and monitoring strategies employed in JD.com’s PC homepage, illustrating how backend technologies such as OpenResty, Lua, Redis, and NGINX are orchestrated to achieve high availability and sub‑30 ms page loads.

Backend DevelopmentLuaOpenResty

0 likes · 12 min read

Technical Implementation and Resilience Practices of JD.com PC Homepage

ITFLY8 Architecture Home

May 15, 2019 · Backend Development

Dubbo & Zookeeper Failure: How Services Stay Connected, Direct Links & Security

This article explains how Dubbo handles service communication when the Zookeeper registration center crashes, the role of local caches, the differences between registry-based and direct point‑to‑point connections, Dubbo’s token‑based security, and how new providers are discovered or missed.

Direct ConnectionDubboResilience

0 likes · 7 min read

Dubbo & Zookeeper Failure: How Services Stay Connected, Direct Links & Security

Wukong Talks Architecture

Apr 27, 2019 · Backend Development

Implementing a Circuit Breaker Mechanism for Backend API Calls

This article explains a practical circuit‑breaker design for backend services, detailing detection logic, algorithm thresholds, time‑window statistics, recovery duration, manual overrides, a global switch, and how to monitor the breaker’s current state using Redis.

APICircuit BreakerRate Limiting

0 likes · 6 min read

Implementing a Circuit Breaker Mechanism for Backend API Calls

Wukong Talks Architecture

Apr 24, 2019 · Backend Development

Circuit Breaker Mechanism: Detection, Algorithm, Time Window, Duration, Manual Trigger, Global Switch, and Monitoring

This article explains a project's circuit breaker implementation, covering detection steps, the algorithm based on request count and failure rate, time‑window statistics, recovery duration, manual activation, a global enable switch, and how to monitor its current state.

Circuit BreakerResiliencefailure rate

0 likes · 5 min read

Circuit Breaker Mechanism: Detection, Algorithm, Time Window, Duration, Manual Trigger, Global Switch, and Monitoring

Alibaba Cloud Developer

Mar 28, 2019 · Operations

How ChaosBlade Empowers You to Build Resilient Cloud‑Native Systems

ChaosBlade is an open‑source chaos engineering tool from Alibaba that lets you repeatedly inject failures into distributed systems, helping you measure fault tolerance, validate orchestration, test platform robustness, verify monitoring alerts, and improve emergency response capabilities for more reliable cloud‑native applications.

DevOpsDistributed SystemsOpen Source

0 likes · 9 min read

How ChaosBlade Empowers You to Build Resilient Cloud‑Native Systems

High Availability Architecture

Jan 24, 2019 · Operations

Understanding Chaos Engineering: Principles, Practices, and Lessons from Netflix

This article explains chaos engineering, its origins at Netflix, core principles, practical steps for running experiments, and how organizations can use controlled failure injection to improve system resilience and operational confidence in complex distributed environments.

Distributed SystemsNetflixReliability

0 likes · 9 min read

Understanding Chaos Engineering: Principles, Practices, and Lessons from Netflix

High Availability Architecture

Sep 12, 2018 · Backend Development

Circuit Breaker and Retry Mechanisms in Microservices with Hystrix‑Go

This article explains the principles and operation of circuit breakers and retry mechanisms in microservice architectures, describes their three states, key configuration parameters, demonstrates a Hystrix‑Go implementation, and discusses back‑off strategies and the combined use of both techniques for resilient backend services.

Circuit BreakerMicroservicesResilience

0 likes · 7 min read

Circuit Breaker and Retry Mechanisms in Microservices with Hystrix‑Go

HomeTech

Aug 28, 2018 · Backend Development

Understanding Hystrix Circuit Breaker: Concepts, Configuration, and Usage in Spring Cloud

This article explains the role of circuit breakers in microservice architectures, introduces Netflix Hystrix and its integration with Spring Cloud, and provides detailed configuration, usage examples, and best‑practice guidelines for building resilient Java backend services.

Circuit BreakerHystrixJava

0 likes · 9 min read

Understanding Hystrix Circuit Breaker: Concepts, Configuration, and Usage in Spring Cloud

DevOps

May 7, 2018 · Cloud Computing

Netflix’s Journey: From DVD Rental to Cloud‑Native Chaos Engineering on AWS

This article chronicles Netflix’s evolution from a DVD‑rental startup to a cloud‑native streaming giant, highlighting its partnership with AWS, the development of chaos‑engineering tools like Chaos Monkey and the Simian Army, and the open‑source technologies that underpin its resilient, scalable architecture.

NetflixResilienceSimian Army

0 likes · 14 min read

Netflix’s Journey: From DVD Rental to Cloud‑Native Chaos Engineering on AWS

DevOpsClub

May 1, 2018 · Cloud Computing

How Netflix Uses Chaos Monkey and AWS to Build Resilient Cloud Services

The article traces Netflix’s evolution from DVD rentals to a cloud‑native streaming giant, explains how it leverages AWS for massive scale, and details its chaos‑engineering tools—Chaos Monkey, Simian Army, and related monkeys—that continuously test and improve system resilience.

DevOpsNetflixResilience

0 likes · 13 min read

How Netflix Uses Chaos Monkey and AWS to Build Resilient Cloud Services

ITFLY8 Architecture Home

Mar 5, 2018 · Backend Development

Mastering the Circuit Breaker Pattern: Design, Implementation, and Testing

This article explains the circuit breaker pattern for distributed systems, detailing its problem context, state machine solution, implementation in C#, key considerations, usage scenarios, and comprehensive unit tests, illustrating how to improve system resilience and prevent cascading failures.

CCircuit BreakerDistributed Systems

0 likes · 21 min read

Mastering the Circuit Breaker Pattern: Design, Implementation, and Testing

dbaplus Community

Oct 23, 2017 · Databases

How eBay Builds Resilient Multi‑Data‑Center Applications with MongoDB

The article explains eBay's use of MongoDB to create highly available, fault‑tolerant multi‑data‑center architectures, detailing design patterns, replica set configurations, read/write strategies, and recent MongoDB features that enable scalable, mission‑critical applications.

Database DesignMongoDBMulti-Data Center

0 likes · 8 min read

How eBay Builds Resilient Multi‑Data‑Center Applications with MongoDB

Architecture Digest

Jun 19, 2016 · Backend Development

Preventing and Recovering from Service Overload Caused by Cache Failures

This article analyzes how introducing caches can cause service overload, examines five cache get patterns, and proposes prevention, recovery, and flow‑control strategies for both client and server sides to ensure system stability.

CacheDistributed SystemsFlow Control

0 likes · 20 min read

Preventing and Recovering from Service Overload Caused by Cache Failures