Tag

disaster recovery

1 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
Apr 22, 2025 · Operations

Client‑Side DCDN Disaster‑Recovery Drills and Automated Testing at Bilibili

Bilibili performed client-side DCDN disaster-recovery drills using a self-built HTTPDNS to simulate DNS, CDN, and SSL faults; automated scripts across Android, iOS, and Web injected errors, measured rendering latency, validated immediate downgrade to commercial services, refined fallback strategies, and demonstrated near-zero user impact during a real network incident.

BilibiliDCDNHttpDNS
0 likes · 13 min read
Client‑Side DCDN Disaster‑Recovery Drills and Automated Testing at Bilibili
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 10, 2025 · Cloud Native

Service-Level Disaster Recovery with Alibaba Cloud Service Mesh (ASM) across Multi-Cluster and Multi-Region Deployments

This guide explains how to handle service‑level failures in Kubernetes by using Alibaba Cloud Service Mesh (ASM) to automatically detect faults, shift traffic based on geographic priority, and implement various multi‑cluster, multi‑region, and multi‑cloud topologies for high availability.

ASMKubernetesService Mesh
0 likes · 31 min read
Service-Level Disaster Recovery with Alibaba Cloud Service Mesh (ASM) across Multi-Cluster and Multi-Region Deployments
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 8, 2025 · Cloud Native

Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM

This guide explains how to achieve zone‑level disaster recovery on Alibaba Cloud by deploying multi‑AZ ACK clusters, configuring Service Mesh ASM for observability and traffic shifting, and using Prometheus‑based metrics and alerts to detect and isolate failures, including step‑by‑step instructions and sample YAML manifests.

Alibaba CloudKubernetesPrometheus
0 likes · 24 min read
Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 6, 2025 · Cloud Native

Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM

This guide explains how to design and implement a multi‑region disaster‑recovery solution on Alibaba Cloud by deploying identical Kubernetes clusters, configuring ASM ingress gateways with global traffic manager (GTM) for automatic failover, enabling intra‑cluster traffic retention, and validating the setup with load‑testing tools.

GTMKubernetesService Mesh
0 likes · 15 min read
Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM
Efficient Ops
Efficient Ops
Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

Outage ManagementReliability EngineeringSRE
0 likes · 6 min read
What 2024’s Biggest Outages Teach Us About Building Resilient Systems
Yang Money Pot Technology Team
Yang Money Pot Technology Team
Dec 26, 2024 · Frontend Development

Design and Implementation of a Multi‑CDN Disaster Recovery Mechanism for Frontend Resource Loading

This article presents a comprehensive multi‑CDN disaster‑recovery solution for frontend static resources, detailing the background, current issues, goals, SDK‑based architecture, monitoring and retry strategies, data‑reporting mechanisms, evaluation results, and future dynamic scheduling improvements.

MonitoringRetrycdn
0 likes · 12 min read
Design and Implementation of a Multi‑CDN Disaster Recovery Mechanism for Frontend Resource Loading
Bilibili Tech
Bilibili Tech
Nov 19, 2024 · Operations

Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons

Bilibili’s infrastructure team created a lightweight, multi‑layered disaster‑recovery drill platform—combining an atomic fault library, scenario catalogs, chaos‑experiment orchestration, real‑time observation, and a product‑level interface—backed by standardized governance and CI‑integrated automation, cutting drill preparation from weeks to days and boosting weekly resilience testing across the organization.

AutomationChaos EngineeringHigh Availability
0 likes · 39 min read
Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 18, 2024 · Cloud Native

Alibaba Cloud ACK Backup Center: Kubernetes Disaster Recovery and Migration with Resource Adjustment Strategies

This article explains how Alibaba Cloud ACK Backup Center simplifies Kubernetes disaster recovery and cross‑cluster migration by offering automated resource‑adjustment policies, detailed backup and restore workflows, and a step‑by‑step best‑practice example for migrating a stateful application with custom YAML configurations.

AckBackupKubernetes
0 likes · 10 min read
Alibaba Cloud ACK Backup Center: Kubernetes Disaster Recovery and Migration with Resource Adjustment Strategies
Efficient Ops
Efficient Ops
Nov 14, 2024 · Operations

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

The recent Alipay outage during Double‑11 revealed a partial failure in its system message database, prompting users to experience payment errors, duplicate charges, and delayed withdrawals, while the company’s response highlighted the importance of comprehensive backup, redundancy, disaster‑recovery planning, monitoring, and security measures to ensure service continuity.

AlipayBackupSRE
0 likes · 10 min read
Why Alipay Crashed: Lessons on Backup and Disaster Recovery
ByteDance Cloud Native
ByteDance Cloud Native
Nov 8, 2024 · Databases

Designing Reliable Cross-Cloud Database Disaster Recovery with Volcano Engine

This article explains how to design and implement cross-cloud database disaster recovery, covering background goals, common challenges, step-by-step migration stages, the role of Volcano Engine’s Database Transmission Service, cold-hot separation, HTAP analysis, and practical business value with real-world examples.

Cloud ComputingDTSHigh Availability
0 likes · 12 min read
Designing Reliable Cross-Cloud Database Disaster Recovery with Volcano Engine
JD Tech
JD Tech
Aug 22, 2024 · Backend Development

Designing a Disaster‑Recovery Data Backup System for JD’s LBS C‑End SOA Service

This article explores the design and implementation of a disaster‑recovery data‑backup architecture for JD’s LBS C‑end SOA service, covering backup strategies, cost‑reduction techniques, grid‑based indexing with H3, client‑side caching, diff verification, and deployment considerations to balance reliability, performance, and expense.

LBSSOAbackend
0 likes · 18 min read
Designing a Disaster‑Recovery Data Backup System for JD’s LBS C‑End SOA Service
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

NetEase Cloud Musicdisaster recoverygray release
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
JD Retail Technology
JD Retail Technology
Aug 21, 2024 · Operations

Designing a Disaster Recovery and Data Backup System for JD 秒送 LBS C‑End SOA Services

This article explores the design of a disaster‑recovery framework for JD’s秒送 LBS C‑end SOA services, detailing data‑backup strategies, cost‑reduction techniques, grid‑based caching using H3, diff validation, client‑side caching, and deployment modules to balance reliability, performance, and expense.

Backend ServicesLBScost-optimization
0 likes · 18 min read
Designing a Disaster Recovery and Data Backup System for JD 秒送 LBS C‑End SOA Services
JD Tech Talk
JD Tech Talk
Aug 16, 2024 · Operations

Designing Cost‑Effective Disaster Recovery Data Backup for LBS‑Based SOA Services

This article details a comprehensive disaster‑recovery strategy for LBS‑driven SOA services, covering challenges of massive POI data backup, cost‑reduction via grid indexing (H3), selective caching, compression, diff validation, client‑side fallback, and deployment processes to achieve reliable, low‑cost data availability.

LBScost-optimizationdata backup
0 likes · 19 min read
Designing Cost‑Effective Disaster Recovery Data Backup for LBS‑Based SOA Services
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

Business ContinuityIT OperationsMonitoring
0 likes · 7 min read
Boost Business Continuity and IT System Stability: Practical Strategies
Architecture and Beyond
Architecture and Beyond
Jun 1, 2024 · Operations

Comprehensive Guide to Data Backup and Disaster Recovery Strategies

This article examines real-world backup failures, explains why backups are essential, outlines what data and system components should be backed up, describes backup principles, classifications, technologies, and disaster recovery planning, and offers practical guidance for building robust, multi-layered backup strategies.

BackupIT Operationscloud backup
0 likes · 13 min read
Comprehensive Guide to Data Backup and Disaster Recovery Strategies
iQIYI Technical Product Team
iQIYI Technical Product Team
May 24, 2024 · Operations

High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)

iQIYI’s Video Relay Service ensures uninterrupted video playback by employing a two‑region, three‑center hybrid cloud architecture, multi‑layer storage, cross‑AZ retry mechanisms, protective rate‑limiting and degradation paths, layered monitoring, and rigorous stress‑testing and chaos engineering to achieve high availability and disaster recovery.

High AvailabilityMonitoringVideo Streaming
0 likes · 18 min read
High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)
Baidu Geek Talk
Baidu Geek Talk
Mar 4, 2024 · Databases

Bank Core System Transformation and GaiaDB-X Distributed Database Solutions for Financial Scenarios

To meet exploding transaction volumes, rapid innovation cycles, and strict regulatory demands, large banks are replacing mainframe core systems with distributed, horizontally‑scalable architectures, and Baidu’s GaiaDB‑X database—offering strong ACID consistency, zero‑RPO disaster recovery, and automated operations—has successfully powered core banking migrations for institutions such as Bank of China and state‑owned banks.

Distributed DatabaseGaiaDB-XTSO consistency
0 likes · 26 min read
Bank Core System Transformation and GaiaDB-X Distributed Database Solutions for Financial Scenarios
Efficient Ops
Efficient Ops
Feb 1, 2024 · Operations

How Tencent’s Public Gateway Overcomes Extreme Availability Challenges

The article details Tencent's Public Gateway (TGW) architecture, its forwarding and control planes, and presents two real‑world extreme failure cases— a NIC batch bug and a special IPv6 packet causing core dumps—along with the multi‑level disaster‑recovery design and mitigation strategies employed to ensure high availability.

Tencent Cloudavailabilitydisaster recovery
0 likes · 8 min read
How Tencent’s Public Gateway Overcomes Extreme Availability Challenges
Tencent Cloud Developer
Tencent Cloud Developer
Nov 30, 2023 · Cloud Computing

X's Cloud Cost Reduction and the Shift Toward On‑Premises: Implications for Cloud Computing Trends

X (formerly Twitter) cut monthly cloud spending by 60% by shifting workloads and storage to on‑premises infrastructure, igniting a debate over whether de‑clouding is viable for all enterprises, how it signals a potential inflection point in cloud computing, and what strategies—balancing high availability, disaster recovery, and cost efficiency—should guide firms, as highlighted in the upcoming TVP Tech Sleepless Nights series featuring leading industry experts.

Cloud ComputingEnterprise ITHigh Availability
0 likes · 7 min read
X's Cloud Cost Reduction and the Shift Toward On‑Premises: Implications for Cloud Computing Trends