Tag

Incident Analysis

0 views collected around this technical thread.

DevOps Operations Practice
DevOps Operations Practice
May 20, 2024 · Cloud Computing

Google Cloud Data Deletion Incident at UniSuper: Causes, Impact, and Lessons Learned

Google Cloud mistakenly deleted data and backups for Australian pension fund UniSuper, causing over 600,000 members to lose access for more than a week, and the incident highlights the risks of single‑provider reliance, the importance of robust backup strategies, and the growing relevance of hybrid and multi‑cloud architectures.

BackupCloud ComputingData Loss
0 likes · 5 min read
Google Cloud Data Deletion Incident at UniSuper: Causes, Impact, and Lessons Learned
Code Ape Tech Column
Code Ape Tech Column
Dec 4, 2023 · Cloud Native

Analysis of Didi’s Kubernetes Outage and General Mitigation Strategies

The article reviews Didi’s 12‑hour P0 outage caused by a Kubernetes upgrade failure in a massive cluster, discusses the root causes, and proposes general solutions such as federation, careful upgrade planning, and multi‑master designs to avoid similar incidents.

Cloud NativeCluster ScalingIncident Analysis
0 likes · 8 min read
Analysis of Didi’s Kubernetes Outage and General Mitigation Strategies
Java Captain
Java Captain
Nov 30, 2023 · Operations

Analysis of Didi's November 2023 System Outage and Potential Technical Causes

The article reviews Didi's late‑November 2023 service disruption, detailing the timeline of failures, official apologies, and expert analyses of six possible technical causes—including software bugs, server issues, third‑party failures, DDoS, other attacks, and ransomware—while highlighting the role of a Kubernetes upgrade and cost‑cutting pressures.

Cloud NativeDidiIncident Analysis
0 likes · 7 min read
Analysis of Didi's November 2023 System Outage and Potential Technical Causes
Efficient Ops
Efficient Ops
Feb 10, 2022 · Operations

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

A production incident on an elastic‑cloud deployment revealed that setting the JVM Metaspace limit to 64 MiB, while the application required around 76 MiB, triggered continuous Full GC, causing stop‑the‑world pauses, full‑line time‑outs, and a costly rollback.

Elastic CloudGCIncident Analysis
0 likes · 9 min read
Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?
Efficient Ops
Efficient Ops
Sep 23, 2021 · Operations

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

The article recounts a staged rollout of the Maybach service on elastic cloud, details the timeline of successful and failing deployments, analyzes JVM metrics revealing excessive Metaspace usage that triggered continuous full garbage collections, and explains how this caused system‑wide timeouts and a half‑hour outage.

Full GCIncident AnalysisJVM
0 likes · 10 min read
Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC
Byte Quality Assurance Team
Byte Quality Assurance Team
Dec 16, 2020 · Backend Development

Live Streaming Service Overload Incident Caused by Self-Referencing Push Configuration

A sudden surge in live‑stream traffic overloaded the core streaming service because a push configuration mistakenly pointed to the same stream URL, creating a self‑referencing loop that repeatedly generated duplicate streams until the service capacity was exhausted.

Incident AnalysisLive Streamingbackend bug
0 likes · 4 min read
Live Streaming Service Overload Incident Caused by Self-Referencing Push Configuration
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 5, 2016 · Operations

Hot Reload: Common Pitfalls and How to Avoid Them

This article examines the hidden risks of hot‑reload mechanisms in web services, illustrates real incidents caused by careless configuration updates, analyzes root causes, and offers practical steps for detecting and fixing such pitfalls to improve operational reliability.

Configuration ManagementIncident AnalysisRisk Mitigation
0 likes · 7 min read
Hot Reload: Common Pitfalls and How to Avoid Them