Tag

incident management

1 views collected around this technical thread.

Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsSREincident management
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
Efficient Ops
Efficient Ops
Mar 18, 2025 · Operations

Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions

This article compiles diverse Zhihu comments on the reality of 24 × 7 on‑call duties, contrasting exaggerated myths with practical team‑based solutions, global shift models, backup strategies, and actionable tips for improving operations without sacrificing personal life.

SREautomationincident management
0 likes · 7 min read
Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions
Efficient Ops
Efficient Ops
Mar 16, 2025 · Operations

Ops Jargon Cheat Sheet: Decode the Most Common SRE Slang

A comprehensive guide to the most frequently used operations slang, covering incident severity codes, deployment terms, monitoring alerts, system maintenance phrases, and self‑deprecating jokes that every seasoned SRE should understand.

DevOpsincident managementjargon
0 likes · 5 min read
Ops Jargon Cheat Sheet: Decode the Most Common SRE Slang
Efficient Ops
Efficient Ops
Mar 4, 2025 · Operations

Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems

This article explains how SRE teams should collaboratively define Service Level Indicators, Objectives, and Agreements, and then cover reliability, performance, observability signals, error budgeting, risk management, incident handling, and the engineering work needed to build robust cloud‑native platforms.

Error BudgetObservabilitySLI
0 likes · 13 min read
Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems
JD Tech Talk
JD Tech Talk
Feb 26, 2025 · Operations

Business Monitoring: Importance, Metric System Design, and Practical Implementation

This article explains the significance of business monitoring, distinguishes technical and business metrics, outlines a step‑by‑step process for building a business metric system, and shares practical experiences, tools, and common pitfalls to help teams improve operational reliability and decision‑making.

Business Monitoringincident managementmetrics
0 likes · 13 min read
Business Monitoring: Importance, Metric System Design, and Practical Implementation
JD Tech Talk
JD Tech Talk
Feb 6, 2025 · Operations

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

This article outlines comprehensive stability assurance mechanisms—including standards, process workflows, the distinction between developers and SREs, personal responsibilities, and practical construction directions—to guide teams in building resilient, high‑availability systems through proactive, daily, and incident‑response practices.

Reliability EngineeringSREincident management
0 likes · 10 min read
Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)
Efficient Ops
Efficient Ops
Jan 14, 2025 · Operations

What Ops Professionals Learn from Real-World Incident Stories

This article compiles real‑world operations incidents—from accidental database deletions and faulty deployments to hidden data tampering and network device failures—highlighting how quick diagnosis, preventive maintenance, and SRE practices can mitigate impact on users, reputation, and revenue.

DevOpsSREcase studies
0 likes · 6 min read
What Ops Professionals Learn from Real-World Incident Stories
Efficient Ops
Efficient Ops
Dec 22, 2024 · Operations

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

KubernetesOpenAIcloud operations
0 likes · 8 min read
What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery
Cognitive Technology Team
Cognitive Technology Team
Aug 17, 2024 · Operations

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

On August 14, 2024, GitHub experienced a massive site-wide outage caused by a database infrastructure configuration change that disrupted traffic routing, leading to loss of database connections and affecting core services such as Pull Requests, Pages, Copilot, and the API, with full restoration confirmed later that evening.

DatabaseGitHubincident management
0 likes · 2 min read
GitHub Outage on August 14, 2024: Causes, Impact, and Recovery
Bilibili Tech
Bilibili Tech
Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

MTTRSREautomation
0 likes · 22 min read
Design and Implementation of Bilibili's Emergency Response Center for Incident Management
Continuous Delivery 2.0
Continuous Delivery 2.0
Aug 13, 2024 · R&D Management

Is Amazon's COE Process Really Effective? Insights from SDEs

The article examines Amazon's Correction of Errors (COE) process, presenting both supportive and critical SDE perspectives, and discusses whether detailed post‑incident documentation truly improves engineering practices or merely adds bureaucratic overhead.

CoEEngineering CultureSDE
0 likes · 10 min read
Is Amazon's COE Process Really Effective? Insights from SDEs
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 1, 2024 · Artificial Intelligence

How Meta Uses Llama2 to Accelerate Incident Response and Root‑Cause Analysis in AIOps

This article explains how Meta applies AI, specifically a fine‑tuned Llama2 model, to improve AIOps by automating incident monitoring, providing real‑time summaries, assisting responders with contextual information, and efficiently narrowing down root‑cause changes, ultimately reducing incident resolution time from hours to minutes.

AIAIOpsLlama2
0 likes · 13 min read
How Meta Uses Llama2 to Accelerate Incident Response and Root‑Cause Analysis in AIOps
Efficient Ops
Efficient Ops
May 20, 2024 · Operations

Mastering Incident Blame: Proven Tactics to Navigate Fault Responsibility

This guide outlines practical principles and communication techniques for assigning responsibility during system incidents, helping operations teams stay calm, choose allies wisely, and protect themselves while ensuring effective fault resolution and continuous improvement.

DevOpsblame assignmentcommunication tactics
0 likes · 9 min read
Mastering Incident Blame: Proven Tactics to Navigate Fault Responsibility
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

Fault InjectionSREavailability
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
Efficient Ops
Efficient Ops
May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleSREincident management
0 likes · 12 min read
11 Hard‑Earned Lessons from Two Decades of Google Site Reliability
DevOps Cloud Academy
DevOps Cloud Academy
Apr 22, 2024 · Cloud Native

Understanding Platform Engineering: Principles, Tools, and Emerging Trends

This article explains how platform engineering formalizes internal processes and tools to give developers a self‑service, automated "golden path," outlines its six core categories—including internal developer portals, infrastructure as code, and incident management—and discusses its growing impact on modern cloud‑native development.

DevOpsInfrastructure as Codecloud native
0 likes · 9 min read
Understanding Platform Engineering: Principles, Tools, and Emerging Trends
Cognitive Technology Team
Cognitive Technology Team
Apr 15, 2024 · Operations

Tencent Cloud Service Outage on April 8: Root Cause, Impact, and Improvement Measures

On April 8, Tencent Cloud experienced a major service outage caused by a cloud API failure that prevented console login and disrupted several public cloud services for 87 minutes, prompting a detailed post‑mortem that outlines the root cause, impact, and a series of operational and change‑management improvements.

Change ManagementTencent Cloudcloud API
0 likes · 4 min read
Tencent Cloud Service Outage on April 8: Root Cause, Impact, and Improvement Measures
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

SREfault handlingincident management
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
High Availability Architecture
High Availability Architecture
Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

AIOpsAnomaly DetectionNLP
0 likes · 24 min read
AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation
Efficient Ops
Efficient Ops
Jan 2, 2024 · Operations

Is 24/7 On‑Call Ops Really Terrifying? Real Insights from Chinese Ops Professionals

A collection of Zhihu answers reveals how large tech firms use multi‑timezone teams for 24/7 on‑call coverage, while smaller companies rely on rotation, backup, and automation to keep operations manageable, showing that constant availability need not be a nightmare.

24x7automationincident management
0 likes · 8 min read
Is 24/7 On‑Call Ops Really Terrifying? Real Insights from Chinese Ops Professionals