Tag

fault tolerance

1 views collected around this technical thread.

Tencent Cloud Developer
Tencent Cloud Developer
May 20, 2025 · Cloud Computing

Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW

The article presents a comprehensive analysis of Tencent's TGW cloud gateway, detailing its modular architecture, high‑performance forwarding plane, lossless state migration, rapid fault recovery, multi‑level redundancy, operational best practices, and security mechanisms that enable ultra‑low latency and high availability for large‑scale internet services.

cloud gatewayfault tolerancehigh-performance forwarding
0 likes · 13 min read
Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 18, 2025 · Fundamentals

How Distributed Consensus Overcomes the FLP Impossibility Theorem

This article explores how to build fault‑tolerant distributed systems by formalizing consensus, outlines its core properties, explains the FLP impossibility theorem, and shows how algorithms like Raft sidestep its limits through timing constraints and recovery mechanisms.

ConsensusFLP theoremHigh Availability
0 likes · 8 min read
How Distributed Consensus Overcomes the FLP Impossibility Theorem
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 11, 2025 · Fundamentals

How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

This article explores how fencing tokens can provide safety and liveness guarantees in distributed lock services, illustrating fault scenarios, token-based conflict resolution, and abstract system models that help engineers prioritize correctness while tolerating temporary unavailability.

distributed systemsfault tolerancefencing tokens
0 likes · 8 min read
How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 11, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

asynchronous networkdistributed systemsfault tolerance
0 likes · 8 min read
Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them
Cognitive Technology Team
Cognitive Technology Team
Apr 8, 2025 · Backend Development

Design and Implementation of RocketMQ NameServer: Core Functions, Architecture, and Optimization Strategies

The article explains RocketMQ NameServer's lightweight, stateless design, its core routing and metadata management functions, AP‑oriented architecture, fault‑tolerant mechanisms, scalability features, and practical optimization techniques for high availability and low operational cost.

Distributed MessagingNameServerbackend
0 likes · 6 min read
Design and Implementation of RocketMQ NameServer: Core Functions, Architecture, and Optimization Strategies
DataFunSummit
DataFunSummit
Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

AI trainingLarge-Scale Trainingcheckpointing
0 likes · 22 min read
Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Chaos EngineeringRate Limitingcircuit breaker
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
IT Services Circle
IT Services Circle
Feb 9, 2025 · Big Data

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

This article explains how HDFS, the Hadoop Distributed File System, splits large files into blocks, replicates them for fault tolerance, organizes the cluster into NameNode and DataNode components, and provides high‑availability and scalability mechanisms such as standby NameNode and federation, enabling reliable big‑data storage and access.

DataNodeHDFSHigh Availability
0 likes · 11 min read
Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability
IT Architects Alliance
IT Architects Alliance
Jan 14, 2025 · Backend Development

Microservice Architecture: Common Problems and Solutions

Microservice architecture, once a buzzword, breaks monolithic applications into independent services, but introduces challenges such as service governance, communication, gateway management, fault tolerance, and tracing; the article outlines these issues and presents practical solutions like Consul/Eureka, REST/RPC, API gateways, Hystrix, and tracing tools.

API GatewayDistributed TracingMicroservices
0 likes · 11 min read
Microservice Architecture: Common Problems and Solutions
High Availability Architecture
High Availability Architecture
Jan 13, 2025 · Operations

Comprehensive Guide to High‑Availability System Architecture and Practices

This article provides a systematic overview of high‑availability system design, covering availability metrics, fault prevention, detection, recovery, capacity planning, service tiering, data layer resilience, monitoring, and the responsibilities of architects, SREs, and developers to ensure reliable, scalable services.

High AvailabilitySystem Architecturecapacity planning
0 likes · 30 min read
Comprehensive Guide to High‑Availability System Architecture and Practices
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

distributed systemsfault tolerancemonitoring
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
DevOps Cloud Academy
DevOps Cloud Academy
Dec 2, 2024 · Artificial Intelligence

Key Kubernetes Features that Benefit AI Inference Workloads

This article explains how Kubernetes’ native scalability, resource optimization, performance tuning, portability, and fault‑tolerance features align with the demands of AI inference, helping organizations run large ML models efficiently, cost‑effectively, and reliably across diverse environments.

AI inferenceKubernetesfault tolerance
0 likes · 15 min read
Key Kubernetes Features that Benefit AI Inference Workloads
Sanyou's Java Diary
Sanyou's Java Diary
Nov 25, 2024 · Cloud Native

Designing Resilient Stateful Distributed Systems: From Theory to Microservice Architecture

This article explores the fundamentals of distributed systems, compares stateful and stateless services, examines monolithic, SOA, and microservice models, and provides practical guidance on access layers, fault tolerance, service discovery, scaling, and data storage for building robust cloud‑native architectures.

MicroservicesService Discoverycloud native
0 likes · 29 min read
Designing Resilient Stateful Distributed Systems: From Theory to Microservice Architecture
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 20, 2024 · Backend Development

Design and Implementation of a High‑Performance Message Notification System

This article presents a comprehensive design of a high‑performance, fault‑tolerant message notification system, covering service partitioning, system architecture, idempotent processing, dynamic error detection, thread‑pool management, retry mechanisms, and stability measures such as traffic‑spike handling, resource isolation, third‑party protection, monitoring, and active‑active deployment.

JavaMessage Notificationbackend architecture
0 likes · 16 min read
Design and Implementation of a High‑Performance Message Notification System
Cognitive Technology Team
Cognitive Technology Team
Nov 15, 2024 · Operations

Building Redundancy in Applications to Avoid Single Points of Failure

The article explains how to design resilient applications by identifying critical paths, adding redundant components, using formulas for overall availability, and applying best‑practice recommendations such as multi‑zone/region deployment, load‑balanced VMs, database replication, and thorough testing of failover mechanisms.

Cloud ArchitectureHigh AvailabilityLoad Balancing
0 likes · 6 min read
Building Redundancy in Applications to Avoid Single Points of Failure
Cognitive Technology Team
Cognitive Technology Team
Nov 14, 2024 · Operations

Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems

To ensure distributed applications can recover automatically from hardware, network, or service failures, this guide outlines three core capabilities—fault detection, graceful handling, and monitoring—plus practical strategies such as asynchronous component separation, retries, circuit breakers, isolation, load shedding, failover, compensation, checkpointing, graceful degradation, rate limiting, leader election, fault injection, chaos engineering, and use of availability zones.

Self-healingcloud nativedistributed systems
0 likes · 7 min read
Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI infrastructureGPU ClusterKubernetes
0 likes · 22 min read
How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
IT Services Circle
IT Services Circle
Oct 4, 2024 · Databases

Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies

This article explains Redis split‑brain behavior, describing its definition, causes such as network failures and Sentinel elections, the resulting data loss during master‑slave switches, and practical prevention measures including quorum configuration, timeout tuning, network monitoring, proxy layers, and the min‑slaves‑to‑write and min‑slaves‑max‑lag settings.

High AvailabilityMaster‑SlaveRedis
0 likes · 7 min read
Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies
FunTester
FunTester
Sep 19, 2024 · Fundamentals

Software Antifragility: Rethinking Error Handling and Reliability

This paper introduces the concept of software antifragility, drawing on Taleb’s theory to argue that embracing errors through fault tolerance, automatic runtime repair, and fault injection can transform software systems into self‑improving, more robust entities, and discusses implications for development processes and product reliability.

Chaos Engineeringantifragilityfault tolerance
0 likes · 13 min read
Software Antifragility: Rethinking Error Handling and Reliability
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Aug 30, 2024 · Cloud Native

Middleware Containerization and Cloud‑Native Transformation at OPPO

OPPO transformed its sprawling, manually‑provisioned middleware clusters into a cloud‑native, containerized platform by building custom Kubernetes controllers, IP‑preserving StatefulSets, resource‑isolated containers, automated monitoring and self‑healing workflows, enabling rapid provisioning, efficient utilization, fault‑tolerant scaling and future serverless and service‑mesh integration.

ContainerizationKubernetesOperator
0 likes · 20 min read
Middleware Containerization and Cloud‑Native Transformation at OPPO