Tag

Incident

0 views collected around this technical thread.

Selected Java Interview Questions
Selected Java Interview Questions
Mar 10, 2025 · Backend Development

Postmortem of a Server Crash Caused by a Mis‑managed Scheduled Task in a Backend Module

The article analyzes a server outage triggered by a module that repeatedly created a scheduled task without proper lifecycle control, examines the problematic Java code, lists four key issues, presents a corrected implementation, and reflects on development, testing, review, and logging practices to prevent similar incidents.

IncidentJavaScheduledExecutorService
0 likes · 5 min read
Postmortem of a Server Crash Caused by a Mis‑managed Scheduled Task in a Backend Module
Ximalaya Technology Team
Ximalaya Technology Team
Sep 13, 2023 · Operations

Cache Instance Failure Incident Analysis and Root Cause Investigation

During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.

CacheIncidentProxy
0 likes · 7 min read
Cache Instance Failure Incident Analysis and Root Cause Investigation
Big Data Technology Architecture
Big Data Technology Architecture
Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

IncidentLuaSLB
0 likes · 16 min read
Postmortem of Bilibili SLB Outage on July 13, 2021
IT Services Circle
IT Services Circle
Mar 22, 2022 · Backend Development

Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies

A recent flash‑sale failure caused by a cache avalanche was analyzed, revealing that setting a uniform two‑hour expiration for all items flooded the database, and the post outlines detection steps, emergency mitigation, and three proven techniques—uniform expiration, mutex locking, and never‑expire caches—to prevent recurrence.

CacheDatabaseIncident
0 likes · 4 min read
Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies
Java Architect Essentials
Java Architect Essentials
Jun 30, 2021 · Operations

Recovering Accidentally Deleted Production Server Data Using ext3grep, extundelete, and MySQL Binlog

After a junior staff member mistakenly ran an unchecked rm‑rf command that erased an entire production server, the author details a step‑by‑step recovery using ext3grep, custom shell scripts, extundelete, and MySQL binlog replay, and concludes with lessons on backup, monitoring, and change management.

BackupData RecoveryIncident
0 likes · 8 min read
Recovering Accidentally Deleted Production Server Data Using ext3grep, extundelete, and MySQL Binlog