Tagged articles

132 articles

Page 1 of 2

Apr 29, 2026 · Operations

Boosting Oncall Interception from 15% to 55%: KOncall’s AI‑Driven Evolution at Kuaishou

Kuaishou’s R&D efficiency team built the KOncall intelligent on‑call platform, integrating LLM‑based retrieval‑augmented generation, Redis Pub/Sub streaming, OCR multimodal parsing, FAQ knowledge ops, and custom reranking, which raised automated query interception from 15% to 55% and processed over 116 000 requests, turning on‑call from a bottleneck into a capability starter.

AI OperationsIncident ManagementLLM

0 likes · 26 min read

Boosting Oncall Interception from 15% to 55%: KOncall’s AI‑Driven Evolution at Kuaishou

DevOps Coach

Mar 31, 2026 · Operations

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

This article explains how modern SRE teams can combine AI‑assisted observability with structured critical thinking to build a 12‑step investigation model that accelerates fault detection, hypothesis generation, telemetry validation, root‑cause analysis, and automated remediation, ultimately reducing MTTR and improving reliability.

AIIncident ManagementObservability

0 likes · 9 min read

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

Efficient Ops

Mar 11, 2026 · Operations

How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops

A single Terraform command executed by the AI assistant Claude Code mistakenly destroyed a production database of over two million records, exposing how over‑reliance on AI, missing state files, weak backup practices, and absent deletion protection can cause massive outages and what safeguards can prevent such incidents.

AI OpsIncident ManagementInfrastructure

0 likes · 6 min read

How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops

Code Wrench

Dec 16, 2025 · Operations

Rescuing a Failing System: 3‑Step Triage Playbook Every Go Engineer Needs

This article explains how to demonstrate real‑world system‑engineering expertise in Go interviews by mastering incident triage, diagnosing CPU, memory, GC, and goroutine problems, and applying a three‑step "stop‑bleed, diagnose, cure" strategy to keep services alive.

GoIncident ManagementOperations

0 likes · 11 min read

Rescuing a Failing System: 3‑Step Triage Playbook Every Go Engineer Needs

Liangxu Linux

Nov 20, 2025 · Operations

Avoid the ‘Delete‑Database‑and‑Run’ Nightmare: 10 Fatal Ops Pitfalls Revealed

A real 2018 incident where an ops engineer used rm ‑rf to wipe a production database sparked a deep dive into the high‑risk nature of operations, presenting Gartner statistics, psychological error factors, ten deadly pitfalls with concrete examples, and a comprehensive fault‑tolerance framework to prevent future catastrophes.

DevOpsIncident Managementautomation

0 likes · 23 min read

Avoid the ‘Delete‑Database‑and‑Run’ Nightmare: 10 Fatal Ops Pitfalls Revealed

DevOps Coach

Nov 11, 2025 · Cloud Computing

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

On October 19‑20 a massive AWS failure in the US‑East‑1 region crippled a large portion of the internet, exposing how a faulty internal monitoring tool, DynamoDB’s lack of cross‑region replication, and unchecked retry storms can cascade into a widespread outage, and offering concrete operational lessons for cloud teams.

Cloud ComputingDynamoDBIncident Management

0 likes · 7 min read

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

Alibaba Cloud Developer

Oct 9, 2025 · Operations

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

This article explains how an AI‑driven multi‑agent platform automates fault postmortem generation, enriches analysis with memory management, prompt engineering, and RAG techniques, and delivers actionable insights for SREs, developers, and non‑technical stakeholders, ultimately shifting incident handling from reactive to proactive.

AIIncident ManagementLLM

0 likes · 44 min read

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

Programmer DD

Oct 3, 2025 · Operations

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

This article explains how Netflix’s engineering teams shifted incident handling from a centralized SRE function to a company‑wide, engineer‑driven practice by selecting the right tooling, standardizing processes, and reshaping culture, enabling rapid, reliable responses for hundreds of millions of viewers.

Incident ManagementNetflixSRE

0 likes · 10 min read

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

MaGe Linux Operations

Sep 28, 2025 · Operations

What Core Skills Do SRE Engineers Need to Master?

This article outlines the essential technical, incident‑response, reliability‑management, collaboration, and systemic‑thinking abilities that Site Reliability Engineering (SRE) professionals must develop to ensure high‑availability, stable services in modern internet environments.

CollaborationIncident ManagementSRE

0 likes · 5 min read

What Core Skills Do SRE Engineers Need to Master?

Ops Community

Sep 16, 2025 · Operations

Mastering SRE: Fast Incident Response and Prevention Strategies

This guide walks SRE engineers through a complete incident lifecycle—preventive multi‑layer monitoring, chaos‑testing drills, rapid 10‑minute response tactics, systematic root‑cause analysis, effective communication roles, post‑mortem reviews, and practical case studies—helping teams minimize downtime and business loss.

Incident ManagementRoot Cause AnalysisSRE

0 likes · 11 min read

Mastering SRE: Fast Incident Response and Prevention Strategies

Java Web Project

Jul 26, 2025 · Backend Development

How a Simple Pagination Change Triggered a P0 Outage and What We Learned

A seemingly trivial pagination update in a Java order service caused a P0 outage, leading to a 73‑minute disruption, 156 user complaints, and an estimated 650,000 CNY GMV loss; the post details the root cause, impact analysis, emergency response, and concrete process improvements to prevent recurrence.

Incident ManagementJavaMicroservices

0 likes · 14 min read

How a Simple Pagination Change Triggered a P0 Outage and What We Learned

Ops Development & AI Practice

Jul 18, 2025 · Operations

Mastering Modern Software Operations: The Six Essential Steps for Success

Modern software operations have shifted from a post‑launch checklist to an ongoing, automated discipline, and this article outlines the six core phases—requirement planning, CI/CD automation, comprehensive monitoring, incident response, performance tuning, and security compliance—providing concrete examples and practical advice for building a resilient DevOps culture.

DevOpsIncident ManagementMonitoring

0 likes · 9 min read

Mastering Modern Software Operations: The Six Essential Steps for Success

Volcano Engine Developer Services

May 22, 2025 · Artificial Intelligence

How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System

This article introduces TickIt, a ByteBrain system that leverages large language models to automatically identify and escalate critical Oncall tickets, detailing its multi‑class escalation, deduplication, and category‑guided fine‑tuning modules, experimental results, and the operational impact on cloud services.

Incident ManagementLLMOncall analysis

0 likes · 13 min read

How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System

Architecture and Beyond

May 10, 2025 · Operations

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

The article explains Heinrich's Law, its 1:29:300 accident pyramid, and how applying its principles—tracking minor incidents, hidden hazards, and systemic risks—can help software teams anticipate, diagnose, and prevent major online failures through systematic safety management and data‑driven practices.

Heinrich's LawIncident ManagementOperations

0 likes · 15 min read

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

Liangxu Linux

Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability

0 likes · 13 min read

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

Open Source Linux

Mar 27, 2025 · Operations

10 Critical Server Ops Mistakes to Avoid: Real-World Lessons

This article outlines ten critical server operation mistakes—ranging from forced power cuts to neglecting updates—illustrated with real-world incidents and practical advice, helping engineers adopt safer practices, proper backups, secure configurations, and effective monitoring to prevent costly outages.

Best PracticesIncident Managementserver operations

0 likes · 6 min read

10 Critical Server Ops Mistakes to Avoid: Real-World Lessons

Efficient Ops

Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsIncident ManagementOperations

0 likes · 6 min read

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

JD Tech Talk

Feb 26, 2025 · Operations

Business Monitoring: Importance, Metric System Design, and Practical Implementation

This article explains the significance of business monitoring, distinguishes technical and business metrics, outlines a step‑by‑step process for building a business metric system, and shares practical experiences, tools, and common pitfalls to help teams improve operational reliability and decision‑making.

Incident ManagementMetricsOperations

0 likes · 13 min read

Business Monitoring: Importance, Metric System Design, and Practical Implementation

Alibaba Cloud Observability

Feb 17, 2025 · Cloud Native

Mastering Enterprise Alerting: Build a Robust Cloud‑Native Monitoring System

This article explores how to design and implement a comprehensive, enterprise‑grade alerting system—covering monitoring fundamentals, MTTF/MTTR concepts, multi‑layer metric collection, alert rule best practices, severity levels, notification channels, false‑positive reduction, and real‑world case studies—to ensure reliable cloud‑native operations.

Incident ManagementMTTROperations

0 likes · 35 min read

Mastering Enterprise Alerting: Build a Robust Cloud‑Native Monitoring System

JD Cloud Developers

Feb 6, 2025 · Operations

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

This article outlines a comprehensive stability framework for SRE teams, detailing essential mechanisms such as review processes, coding standards, incident management, on‑call responsibilities, and daily operational practices, while also highlighting the cultural shift needed to achieve reliable, high‑availability systems.

Incident ManagementOperationsSRE

0 likes · 11 min read

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

Efficient Ops

Dec 22, 2024 · Operations

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

Incident ManagementKubernetesOpenAI

0 likes · 8 min read

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

dbaplus Community

Sep 13, 2024 · Operations

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

Incident ManagementMTTRMonitoring

0 likes · 23 min read

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

Ops Development Stories

Aug 21, 2024 · Operations

How Large Language Models Can Transform Ops Fault Handling: A Practical Guide

This article outlines a typical operations incident workflow, identifies four key stages where large language models can assist, discusses implementation challenges, introduces the Ops framework and Copilot design, and shares practical examples and a real‑world case to help engineers adopt AI‑driven fault management.

AI OpsIncident ManagementOperations

0 likes · 19 min read

How Large Language Models Can Transform Ops Fault Handling: A Practical Guide

Cognitive Technology Team

Aug 17, 2024 · Operations

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

On August 14, 2024, GitHub experienced a massive site-wide outage caused by a database infrastructure configuration change that disrupted traffic routing, leading to loss of database connections and affecting core services such as Pull Requests, Pages, Copilot, and the API, with full restoration confirmed later that evening.

DatabaseGitHubIncident Management

0 likes · 2 min read

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

Bilibili Tech

Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

Incident ManagementMTTRPlatform Engineering

0 likes · 22 min read

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

21CTO

Aug 15, 2024 · Operations

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

A detailed account of GitHub’s recent worldwide outage reveals that a rollback of database infrastructure changes caused widespread service failures across GitHub.com, Pages, Copilot, and the API, highlighting the challenges of stateful database reliability in large platforms.

GitHubIncident ManagementOperations

0 likes · 4 min read

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

Continuous Delivery 2.0

Aug 13, 2024 · R&D Management

Is Amazon's COE Process Really Effective? Insights from SDEs

The article examines Amazon's Correction of Errors (COE) process, presenting both supportive and critical SDE perspectives, and discusses whether detailed post‑incident documentation truly improves engineering practices or merely adds bureaucratic overhead.

COEEngineering CultureIncident Management

0 likes · 10 min read

Is Amazon's COE Process Really Effective? Insights from SDEs

Open Source Linux

Jul 24, 2024 · Operations

Linux Emergency Handbook v1.2: Key Updates & New Incident Response Practices

Version 1.2 of the Linux Emergency Handbook introduces critical updates such as SSH key backdoor checks, detailed command timestamp logs, new journalctl log viewing techniques, enhanced password checks, added data USB guidance, and revamped post‑incident stages including routine security checks, loss assessment, and targeted investigations.

Incident ManagementLinuxemergency response

0 likes · 3 min read

Linux Emergency Handbook v1.2: Key Updates & New Incident Response Practices

DevOps Coach

Jun 30, 2024 · Operations

Effective Incident Mitigation and Recovery: Practical SRE Strategies

The article outlines SRE‑based incident mitigation and recovery practices, covering urgent mitigations, impact reduction, key metrics such as TTD, TTR, TBF, and detailed strategies for shortening detection and repair times, preventing fatigue, improving observability, and designing resilient systems.

Incident ManagementOperationsReliability

0 likes · 23 min read

Effective Incident Mitigation and Recovery: Practical SRE Strategies

Efficient Ops

May 20, 2024 · Operations

Mastering Incident Blame: Proven Tactics to Navigate Fault Responsibility

This guide outlines practical principles and communication techniques for assigning responsibility during system incidents, helping operations teams stay calm, choose allies wisely, and protect themselves while ensuring effective fault resolution and continuous improvement.

Incident Managementblame assignmentcommunication tactics

0 likes · 9 min read

Mastering Incident Blame: Proven Tactics to Navigate Fault Responsibility

Efficient Ops

May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

AvailabilityFault InjectionIncident Management

0 likes · 7 min read

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

ITPUB

May 10, 2024 · Databases

Choosing Low‑Risk Strategies for Critical DBA Outages

When a major operations incident strikes, the safest approach is to prioritize simple, low‑risk actions and accept limited responsibility, as illustrated by real DBA lessons from Oracle RAC failures and a data‑center power‑loss disaster.

DBADatabaseIncident Management

0 likes · 7 min read

Choosing Low‑Risk Strategies for Critical DBA Outages

Efficient Ops

May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleIncident ManagementSRE

0 likes · 12 min read

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

DevOps Cloud Academy

Apr 22, 2024 · Cloud Native

Understanding Platform Engineering: Principles, Tools, and Emerging Trends

This article explains how platform engineering formalizes internal processes and tools to give developers a self‑service, automated "golden path," outlines its six core categories—including internal developer portals, infrastructure as code, and incident management—and discusses its growing impact on modern cloud‑native development.

Incident ManagementInternal Developer PortalPlatform Engineering

0 likes · 9 min read

Understanding Platform Engineering: Principles, Tools, and Emerging Trends

Efficient Ops

Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

Incident ManagementMonitoringSRE

0 likes · 14 min read

Mastering Incident Command: A Practical Guide for SRE Fault Handling

High Availability Architecture

Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

Anomaly DetectionIncident ManagementMachine Learning

0 likes · 24 min read

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

dbaplus Community

Jan 8, 2024 · Operations

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Three real-world operations mishaps are recounted—a mistaken system‑time change that logged out thousands of users, an accidental bulk delete of database accounts, and a failed glibc downgrade that stalled a software release—illustrating the cascading impact of small errors and the urgent remediation steps taken.

DatabaseIncident ManagementLinux

0 likes · 8 min read

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Efficient Ops

Jan 2, 2024 · Operations

Is 24/7 On‑Call Ops Really Terrifying? Real Insights from Chinese Ops Professionals

A collection of Zhihu answers reveals how large tech firms use multi‑timezone teams for 24/7 on‑call coverage, while smaller companies rely on rotation, backup, and automation to keep operations manageable, showing that constant availability need not be a nightmare.

24x7Incident ManagementOn-Call

0 likes · 8 min read

Is 24/7 On‑Call Ops Really Terrifying? Real Insights from Chinese Ops Professionals

Architect

Dec 22, 2023 · Operations

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

The article details Tencent Search’s end‑to‑end stability engineering practice, covering a ten‑step architecture that combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous drills, and shows how these measures collectively reduced mean‑time‑to‑detect and mean‑time‑to‑recover by an order of magnitude while keeping service availability high.

Incident ManagementObservabilityResilience

0 likes · 32 min read

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

Meituan Technology Team

Dec 21, 2023 · Operations

AIOps for Incident Management: Practices and Insights from Meituan

Meituan’s service‑operations team applies AIOps across prevention, detection, and post‑incident stages—using change‑risk analysis, real‑time graph‑based anomaly detection, similarity‑driven root‑cause diagnosis, and NLP‑powered incident recommendation—to achieve sub‑second detection, high precision, 28% faster fault handling, and plans for intelligent log and change recognition.

Anomaly DetectionIncident ManagementOperations

0 likes · 24 min read

AIOps for Incident Management: Practices and Insights from Meituan

Bilibili Tech

Dec 15, 2023 · Operations

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

Alert MonitoringBilibiliIncident Management

0 likes · 21 min read

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Architect

Dec 13, 2023 · Industry Insights

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

This article details Bilibili's end‑to‑end technical planning, traffic‑estimation models, and concrete optimizations—including hotspot caching, traffic dispersion, long‑connection isolation, and automated fault‑injection—that enabled the S13 League of Legends finals to serve over 1.2 billion viewers with stable, low‑latency streaming.

Incident ManagementObservabilityPerformance Optimization

0 likes · 22 min read

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

Efficient Ops

Dec 11, 2023 · Operations

How a Simple System‑Time Change Sparked a Massive Outage

A junior ops engineer mistakenly set the production server clock ahead by a year, causing thousands of user accounts to expire, triggering a large‑scale outage, emergency fixes, financial loss, and harsh career consequences, while highlighting the need for proper permission and change management.

DatabaseIncident Managementpermissions

0 likes · 7 min read

How a Simple System‑Time Change Sparked a Massive Outage

Advanced AI Application Practice

Nov 28, 2023 · Operations

Is a Didi Outage a P0‑Level Incident? Understanding Severity Classifications

The article explains the common P0‑to‑PX incident severity hierarchy used in software development, detailing what constitutes a P0 crash versus lower‑level issues, notes that definitions can vary across organizations, and adds a personal perspective on Didi’s service reliability.

DidiIncident ManagementOperations

0 likes · 3 min read

Is a Didi Outage a P0‑Level Incident? Understanding Severity Classifications

dbaplus Community

Nov 23, 2023 · Operations

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.

Alert Noise ReductionIncident ManagementMonitoring

0 likes · 13 min read

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

FunTester

Nov 21, 2023 · Industry Insights

What Alibaba’s Recent Outages Reveal About Testing and Team Safety

The article examines three major Alibaba service disruptions, analyzes how insufficient testing and a lack of psychological safety among engineers may have contributed to the failures, and suggests ways to improve testing practices and workplace transparency.

AlibabaCloud ServicesIncident Management

0 likes · 7 min read

What Alibaba’s Recent Outages Reveal About Testing and Team Safety

Architecture and Beyond

Oct 29, 2023 · Operations

Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle

The October 23 Yuque outage, caused by a buggy upgrade tool and outdated storage hardware, highlighted the importance of thorough testing, robust disaster‑recovery, high‑availability architecture, clear communication, continuous learning, and applying the KISS principle to simplify complex systems and improve operational stability.

Complex SystemsIncident ManagementKISS principle

0 likes · 10 min read

Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle

Qunhe Technology Quality Tech

Oct 13, 2023 · Operations

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

This article shares KuJiaLe's experience tackling stability challenges caused by rapid user growth and system complexity, detailing their organizational, process, cultural, and technical approaches—including goal setting, a stability committee, monitoring, incident response, change control, and regular drills—to achieve measurable improvements in reliability and performance.

DevOpsIncident ManagementSRE

0 likes · 20 min read

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

Architect

Sep 16, 2023 · Operations

Common Production Failures and Their Handling Procedures

This article outlines the most common production failures—including network, server, database, software bugs, security vulnerabilities, storage, configuration errors, and third‑party service issues—and provides detailed steps for detection, investigation, and resolution to ensure system stability and reliability.

Incident ManagementOperationsTroubleshooting

0 likes · 28 min read

Common Production Failures and Their Handling Procedures

Tech Architecture Stories

Aug 7, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This article explains the essence, purpose, and step‑by‑step process of fault postmortems—including preparation, root‑cause analysis, improvement actions, and decision making—while covering PDCA and GRIA methodologies, industry examples, MTTR/MTBF metrics, and practical templates for lasting reliability.

GRIAIncident ManagementMTTR

0 likes · 24 min read

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

Aikesheng Open Source Community

Jul 24, 2023 · Operations

Exploring On‑Call Duty Models and SRE‑Driven Operations Management

This article examines the challenges of traditional on‑call duty systems for operations teams, proposes an SRE‑inspired rotation model that involves developers, defines concrete KPI targets, and describes how automation and chat‑bot tools can streamline incident response and reduce internal friction.

Incident ManagementKPIOn-Call

0 likes · 12 min read

Exploring On‑Call Duty Models and SRE‑Driven Operations Management

MaGe Linux Operations

Jun 30, 2023 · Operations

What Went Wrong When Vipshop Crashed? Lessons on High‑Concurrency Failures

The article examines the March 29 Vipshop data‑center outage that caused over a billion‑yuan loss, explains the cooling‑system failure that triggered a 12‑hour P0 incident, discusses its impact on Tencent services, and analyzes why high‑concurrency crashes remain common, offering availability tier insights and mitigation strategies.

AvailabilityIncident ManagementOperations

0 likes · 7 min read

What Went Wrong When Vipshop Crashed? Lessons on High‑Concurrency Failures

Test Development Learning Exchange

May 25, 2023 · Operations

Online Incident Severity Level Definition Rules

This document defines the online incident severity grading system, outlining fault categories, influencing factors such as business metrics, capital loss, user impact, and public opinion, and presents detailed P0‑P3 grading rules with tables for capital‑based, C‑end, and B‑end user classifications.

Incident Managementfault classificationservice reliability

0 likes · 8 min read

Online Incident Severity Level Definition Rules

Efficient Ops

May 16, 2023 · Operations

How China Mobile Built a Scalable AIOps Platform to Cut Incident Resolution Time

This article shares China Mobile IT Center's four‑year journey of designing, deploying, and refining a centralized AIOps platform that automates anomaly detection, fault diagnosis, and remediation, dramatically reducing complaint ticket handling from ten to six hours while scaling to billions of AI model calls per month.

AIIncident Managementaiops

0 likes · 18 min read

How China Mobile Built a Scalable AIOps Platform to Cut Incident Resolution Time

IT Services Circle

May 1, 2023 · Operations

Understanding Internet Incident Levels and Prevention – The March 29 Tencent Outage

The article explains the classification of internet service incidents into four levels based on severity and impact, illustrates each level with the March 29 Tencent outage, and outlines practical prevention measures such as security defenses, backup plans, monitoring, training, and emergency response.

Incident ManagementOperationsTencent

0 likes · 5 min read

Understanding Internet Incident Levels and Prevention – The March 29 Tencent Outage

DataFunSummit

Apr 15, 2023 · Operations

Observability and Intelligent Alert Management Practices

This presentation outlines the observability ecosystem, the role and value of alerts within it, core functionalities of an intelligent alarm management platform, best‑practice recommendations, and a real‑world case study of deploying a unified observability solution for a large state‑owned investment group.

Alert ManagementIT OperationsIncident Management

0 likes · 11 min read

Observability and Intelligent Alert Management Practices

MaGe Linux Operations

Mar 24, 2023 · Operations

Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them

This article explains why typical monitoring approaches miss the mark, outlines four root causes of persistent incidents, and introduces the CAR framework—Customer, Application, Resource—to build user‑centric observability that reduces noise, restores trust, and improves reliability.

CAR frameworkIncident ManagementMonitoring

0 likes · 11 min read

Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them

Architect's Guide

Mar 14, 2023 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

This article outlines a call‑center outage scenario, explains how operators diagnose and resolve the issue, and presents a comprehensive set of fault‑handling methods, monitoring enhancements, and emergency‑plan recommendations aimed at faster recovery and eventual self‑healing of services.

Incident Managementcall centerfault-recovery

0 likes · 12 min read

Incident Handling and Fault Recovery Practices for Call Center Systems

DeWu Technology

Feb 8, 2023 · Operations

Container SRE Practices and Incident Management at DeWu

DeWu’s container SRE team combines software‑engineered reliability with routine operations, using defined on‑call roles, SLO/SLA targets, progressive change management, capacity forecasting, four‑metric monitoring, MTTR/MTTF tracking, kernel‑parameter tuning, and namespace‑protected security policies to swiftly resolve incidents such as Redis latency spikes.

ContainerIncident ManagementPerformance Optimization

0 likes · 23 min read

Container SRE Practices and Incident Management at DeWu

HelloTech

Jan 31, 2023 · Operations

Stability Assurance Practices for Large‑Scale Promotional Events

The article outlines a comprehensive stability‑assurance framework for large‑scale promotional events—detailing planning, capacity and pressure‑test rehearsals, strict change‑freeze, internal gray releases, coordinated on‑call response, thorough link and capacity analysis, monitoring, emergency procedures, cross‑team collaboration, external partner coordination, and post‑event review to ensure resilient system performance.

Incident ManagementLarge-Scale EventsMonitoring

0 likes · 17 min read

Stability Assurance Practices for Large‑Scale Promotional Events

Wukong Talks Architecture

Dec 26, 2022 · Operations

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

On December 18, 2023, Alibaba Cloud's Hong Kong Region experienced a severe cooling‑system failure that caused a 14‑hour outage of ECS, OSS, EBS, RDS and other services, prompting extensive emergency procedures, service impact analysis, and a detailed post‑mortem with improvement actions.

Alibaba CloudCloud ComputingIncident Management

0 likes · 14 min read

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

Xiaohe Frontend Team

Nov 15, 2022 · Operations

Mastering Incident Postmortems: Turn Failures into Learning Opportunities

This article explains why thorough, blameless incident postmortems are essential, outlines when to initiate them, describes the key components of an effective review, and offers practical steps to transform each outage into a continuous‑improvement opportunity for engineering teams.

Blameless CultureIncident ManagementRoot Cause Analysis

0 likes · 6 min read

Mastering Incident Postmortems: Turn Failures into Learning Opportunities

Alibaba Cloud Developer

Sep 14, 2022 · Operations

Mastering System Stability: From Fault Prevention to Emergency Response

This article outlines a comprehensive safety‑production framework that covers pre‑incident fault prevention, incident response, and post‑mortem improvement, detailing design‑for‑failure principles such as redundancy, isolation, idempotence, monitoring, automation, disaster recovery, scaling, rate‑limiting, and continuous testing to ensure reliable, resilient services.

Incident ManagementMonitoringReliability

0 likes · 16 min read

Mastering System Stability: From Fault Prevention to Emergency Response

DevOps

Aug 15, 2022 · R&D Management

Case Study: Unintended Data Upload Incident and Process Improvement Lessons

This article recounts a real-world incident where a junior engineer mistakenly uploaded production data to a pre‑release environment, analyzes the root causes, outlines concrete process improvements, and highlights broader lessons on risk‑aware development and the importance of holistic business‑logic security.

Incident ManagementProcess ImprovementR&D Management

0 likes · 8 min read

Case Study: Unintended Data Upload Incident and Process Improvement Lessons

Sanyou's Java Diary

Aug 11, 2022 · Operations

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

This article guides developers through classifying system‑level and business‑level bugs, using Linux utilities like perf, ps, and vmstat for quick root‑cause analysis, and outlines effective code‑design patterns and architectural strategies—caching, rate‑limiting, and high‑availability—to prevent and resolve production incidents.

Incident ManagementLinux performancebackend operations

0 likes · 13 min read

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

Top Architect

Aug 2, 2022 · Operations

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

This article presents a comprehensive guide on diagnosing, monitoring, and quickly resolving call‑center system failures, covering common troubleshooting steps, monitoring enhancements, emergency‑plan design, and intelligent event‑handling techniques to improve operational reliability and response speed.

Incident ManagementOperationsemergency response

0 likes · 15 min read

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

dbaplus Community

Jul 12, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

In July 2021 a sudden CPU‑100% spike in Bilibili's OpenResty‑based SLB caused widespread service outages, prompting an emergency response that rebuilt load‑balancer clusters, traced a Lua _gcd function bug triggered by a zero weight string, and led to extensive operational and architectural improvements.

Cloud NativeIncident ManagementLua

0 likes · 17 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

Ops Development Stories

Jun 16, 2022 · Operations

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

This article outlines a comprehensive approach to handling call‑center incidents, covering fault boundary definition, emergency recovery actions, rapid root‑cause localization, enhanced monitoring strategies, clear alerting, proactive automation, and the creation of concise, regularly exercised emergency response plans.

Incident ManagementMonitoringOperations

0 likes · 14 min read

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

Top Architect

Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

Incident ManagementMonitoringOperations

0 likes · 12 min read

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

Architecture Digest

Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

Incident ManagementMonitoringOperations

0 likes · 13 min read

Bilibili Tech

May 20, 2022 · Operations

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

Bilibili’s SRE team, confronting rapid growth and complex systems, built a systematic stability operation that includes emergency response, incident handling, on‑call scheduling, and an Event Operations Center platform, using metrics like MTTR, MTTI and AI‑assisted automation to reduce downtime and improve reliability.

BilibiliIncident ManagementMetrics

0 likes · 27 min read

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

dbaplus Community

Apr 10, 2022 · Operations

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.

Incident ManagementMonitoringResource Management

0 likes · 16 min read

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

Big Data Technology & Architecture

Apr 6, 2022 · Big Data

Data Quality Issues, Causes, and Practices in Big Data Platforms

This article explains the harms and root causes of data quality problems—such as integrity, latency, accuracy, and consistency issues—then outlines systematic prevention methods, baseline monitoring, and concrete NetEase YouShu platform practices, illustrated with real incidents, code snippets, and tag‑monitoring strategies.

Incident Managementdata engineering

0 likes · 10 min read

Data Quality Issues, Causes, and Practices in Big Data Platforms

Open Source Linux

Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

Incident ManagementMonitoringOperations

0 likes · 13 min read

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

Java Interview Crash Guide

Mar 23, 2022 · Operations

How to Streamline Call Center Incident Management: Proven Steps and Monitoring Strategies

This article walks through a real‑world call‑center outage scenario, outlines practical fault‑handling methods, shows how to improve monitoring and alerting, and presents a comprehensive emergency response plan that helps operations teams resolve incidents faster and prevent future failures.

Incident Managementautomationcall center

0 likes · 13 min read

How to Streamline Call Center Incident Management: Proven Steps and Monitoring Strategies

Open Source Linux

Mar 8, 2022 · Operations

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

This article breaks down Kubernetes troubleshooting into three essential steps—understanding the failure, managing the response, and preventing recurrence—while mapping key monitoring, observability, and incident‑response tools to each phase for reliable cloud‑native operations.

Incident ManagementKubernetesObservability

0 likes · 8 min read

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

Java Interview Crash Guide

Mar 1, 2022 · Operations

How to Accelerate Call Center Incident Recovery with Proactive Monitoring

This article outlines a comprehensive approach to handling call‑center system failures, covering rapid fault identification, emergency recovery steps, enhanced monitoring visualisation, and the creation of sustainable, automated incident‑response plans to improve overall operational resilience.

Incident Managementautomationcall center

0 likes · 13 min read

How to Accelerate Call Center Incident Recovery with Proactive Monitoring

Efficient Ops

Dec 5, 2021 · Operations

Mastering ITIL Event Management: Strategies for Efficient IT Operations

This article explores the fundamentals of ITIL-based event management, detailing its relationship with ITSM, the challenges of unmanaged services, key processes, priority definitions, and three management models—centralized, self‑managed, and collaborative—to help organizations improve service stability and response efficiency.

ITILITSMIncident Management

0 likes · 14 min read

Mastering ITIL Event Management: Strategies for Efficient IT Operations

MaGe Linux Operations

Oct 6, 2021 · Operations

How to Accelerate Call Center Incident Resolution with Smart Monitoring and Automation

This article outlines a comprehensive approach to handling call‑center incidents, covering common troubleshooting steps, proactive monitoring enhancements, well‑structured emergency plans, and intelligent event‑driven automation to reduce downtime and improve operational efficiency.

Incident ManagementMonitoringOperations

0 likes · 12 min read

How to Accelerate Call Center Incident Resolution with Smart Monitoring and Automation

TAL Education Technology

Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Incident ManagementSREcapacity planning

0 likes · 17 min read

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

ByteDance SE Lab

Jul 30, 2021 · Operations

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

The article examines Salesforce’s five‑hour global outage caused by a shortcut DNS deployment and the subsequent recovery challenges, then explores a viral experiment where twenty smartphones generated artificial traffic congestion, illustrating how real‑time data feeds and operational safeguards can prevent large‑scale service disruptions.

Big DataCloud ComputingIncident Management

0 likes · 7 min read

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

DevOps

Jul 28, 2021 · Operations

Improving System Availability: Stages, Influencing Factors, and Practical Measures

This article explains system availability, outlines three stages of incident handling, identifies key factors that degrade availability such as human error, avalanche effects, untested releases and infrastructure failures, and proposes technical and team‑oriented practices to enhance reliability and achieve higher "nines" of uptime.

Incident ManagementOperationsReliability

0 likes · 11 min read

Improving System Availability: Stages, Influencing Factors, and Practical Measures

Efficient Ops

Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

Incident ManagementOperationsescalation

0 likes · 10 min read

Mastering Incident Management: Principles and Methods for Effective Fault Handling

Full-Stack Internet Architecture

Jun 19, 2021 · Operations

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

The article discusses common monitoring challenges such as fragmented tooling and noisy alerts, and proposes solutions including consolidating to a single monitoring framework, prioritizing runtime exceptions, and classifying business alerts with codes and trace information to improve incident response.

Incident ManagementObservabilityalerting

0 likes · 6 min read

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

Alibaba Cloud Native

May 24, 2021 · Operations

How to Build a Data‑Driven Stability Assurance System for Kubernetes Clusters

This article presents a systematic, data‑model‑driven approach to Kubernetes stability assurance, detailing the sources of complexity, a four‑diagram and three‑table data model, insight and pre‑plan structures, global visualisation concepts, deployment patterns, operational workflows, and competitive analysis to enable effective, iterative, and sustainable cluster stability management.

Incident ManagementKubernetesdata modeling

0 likes · 15 min read

How to Build a Data‑Driven Stability Assurance System for Kubernetes Clusters

dbaplus Community

May 18, 2021 · Operations

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

Incident ManagementMetricsMonitoring

0 likes · 25 min read

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

macrozheng

Apr 24, 2021 · Operations

How a Single Code Change Caused Million-Dollar Loss and What It Taught Me About Release Discipline

A routine release introduced a tiny code change that triggered a massive production outage, causing millions in losses; the team’s swift rollback, post‑mortem analysis, and reflections on code discipline, testing, and process compliance highlight essential lessons for reliable backend operations.

Incident Managementcode qualityrelease process

0 likes · 9 min read

How a Single Code Change Caused Million-Dollar Loss and What It Taught Me About Release Discipline

Liangxu Linux

Feb 8, 2021 · Operations

How a Single “return null” Caused Million-Dollar Loss: Lessons in Release Management

A seemingly harmless code change that returned null triggered a massive production outage, costing millions, and the author recounts the incident, the emergency rollback, root‑cause analysis, and the broader lessons about code review, testing, monitoring, and disciplined release practices.

Code ReviewIncident ManagementMonitoring

0 likes · 7 min read

How a Single “return null” Caused Million-Dollar Loss: Lessons in Release Management

Alibaba Cloud Developer

Jan 27, 2021 · Operations

How to Build Sustainable System Stability: Architecture, Ops, and Team Practices

This article shares practical insights from a technical leader on designing robust system architecture, implementing comprehensive capacity planning, establishing reliable operations processes, strengthening security, and cultivating team awareness to achieve long‑term stability for large‑scale internet services.

Incident ManagementOperationsarchitecture design

0 likes · 24 min read

How to Build Sustainable System Stability: Architecture, Ops, and Team Practices

MaGe Linux Operations

Jan 24, 2021 · Operations

How to Speed Up Call Center Incident Resolution with Proven Ops Strategies

This article walks through a real call‑center outage, outlines why traditional ad‑hoc debugging fails, and presents a structured approach—including symptom identification, rapid root‑cause isolation, enhanced monitoring, concise emergency playbooks, and intelligent automation—to dramatically reduce recovery time and move toward self‑healing operations.

Incident Managementautomationcall center

0 likes · 13 min read

How to Speed Up Call Center Incident Resolution with Proven Ops Strategies

Aikesheng Open Source Community

Jan 22, 2021 · Databases

MySQL Data Recovery Using Binlog Analysis and Reverse Binlog Generation

This article details a real‑world MySQL production data loss incident, explains how to identify the relevant binlog range, use binlog2sql and MyFlash to generate reverse SQL or binary logs, and outlines the step‑by‑step recovery process and post‑mortem reflections for operations teams.

Data RecoveryDatabase operationsIncident Management

0 likes · 12 min read

MySQL Data Recovery Using Binlog Analysis and Reverse Binlog Generation

ITPUB

Oct 9, 2020 · Operations

How to Streamline Call Center Incident Management: Practical Steps and Best Practices

This guide walks through a real‑world call‑center slowdown incident, outlines common fault‑handling techniques, proposes monitoring enhancements, details a comprehensive emergency‑response plan, and introduces intelligent event‑processing concepts to help operations teams resolve outages faster and more reliably.

Incident ManagementMonitoringOperations

0 likes · 15 min read

How to Streamline Call Center Incident Management: Practical Steps and Best Practices

Open Source Linux

Sep 12, 2020 · Operations

Mastering Incident Response: Core Principles and Practical Methods

This guide outlines essential incident‑response principles—prioritizing business restoration and timely escalation—while detailing practical methods such as restart, isolation, and degradation, and explains how to organize response teams and conduct thorough post‑incident reviews.

Incident ManagementRestartdegradation

0 likes · 11 min read

Mastering Incident Response: Core Principles and Practical Methods

Efficient Ops

Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Incident ManagementOperationsfault handling

0 likes · 10 min read

Mastering Incident Management: Core Principles and Practical Methods

Efficient Ops

Sep 8, 2020 · Operations

From Firefighting to Arson: Mastering Ops Availability in Three Stages

The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.

AvailabilityFault InjectionIncident Management

0 likes · 8 min read

From Firefighting to Arson: Mastering Ops Availability in Three Stages

Didi Tech

Jun 3, 2020 · Backend Development

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

Incident Managementbackend stabilitydeployment best practices

0 likes · 21 min read

Stability Guidelines and Anti‑Patterns for Backend Services

Efficient Ops

Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

Continuous ImprovementITILIncident Management

0 likes · 11 min read

Master Incident Management: Definitions, Processes, and Best Practices

Continuous Delivery 2.0

Mar 6, 2020 · Operations

Google Incident Postmortem Checklist

The article presents a detailed Google‑derived post‑mortem checklist covering event data collection, root‑cause analysis, lessons learned, actionable improvement items, and review procedures to ensure systematic, non‑blame‑focused incident handling.

Incident ManagementOperationsRoot Cause Analysis

0 likes · 5 min read

G7 EasyFlow Tech Circle

Dec 27, 2019 · Operations

Mastering Incident Reviews: The Three Golden Questions for Real Improvement

This article explains how focusing on three key questions during incident post‑mortems, balancing business speed with system stability, and establishing clear SLOs can turn failures into actionable improvements and better fault‑tolerance strategies.

Incident ManagementOperationsSLO

0 likes · 8 min read

Mastering Incident Reviews: The Three Golden Questions for Real Improvement

360 Tech Engineering

Oct 31, 2019 · Operations

AIOps Implementation Practice at 360: Architecture, Models, and Automation

The article details 360's AIOps deployment, covering external speaker insights, internal architecture, data collection pipelines, AI models for resource recycling, alarm reduction, and correlation, as well as visualization dashboards, labeling platforms, and self‑healing mechanisms, illustrating a comprehensive AI‑driven operations framework.

AI MonitoringIncident ManagementOperations Automation

0 likes · 14 min read

AIOps Implementation Practice at 360: Architecture, Models, and Automation