Tag

Incident Response

1 views collected around this technical thread.

Cognitive Technology Team
Cognitive Technology Team
Jun 17, 2025 · Cloud Computing

What a Single NullPointerException Taught Us About Cloud Reliability

The June 2025 Google Cloud outage, caused by an untested code change that triggered a NullPointerException, crippled over 70 core services worldwide, prompting a rapid technical fix, public apology, and industry‑wide reflections on cloud stability, fault tolerance, and deployment practices.

Incident Responsecloud outagegoogle cloud
0 likes · 7 min read
What a Single NullPointerException Taught Us About Cloud Reliability
Efficient Ops
Efficient Ops
Jun 9, 2025 · Operations

How OnCall Platforms Transform Incident Management and Reduce Manual Overhead

This article explains the purpose and key features of OnCall platforms, compares popular solutions like PagerDuty, Opsgenie, Grafana OnCall and Alibaba Cloud ARMS, clarifies webhooks with a simple analogy, and summarizes how centralized on‑call management boosts operational efficiency while minimizing manual intervention.

Incident ResponseMonitoringOncall
0 likes · 5 min read
How OnCall Platforms Transform Incident Management and Reduce Manual Overhead
Efficient Ops
Efficient Ops
May 20, 2025 · Information Security

How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services

A coordinated overseas cyber‑attack breached a Guangzhou tech firm's self‑service equipment backend, causing hours of service outage, data leakage, and significant losses, prompting swift police investigation, evidence preservation, and a detailed technical analysis of the attackers' methods.

Incident ResponseInformation Securitychina
0 likes · 4 min read
How an Overseas Hacker Group Disrupted a Guangzhou Tech Company's Services
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Feb 27, 2025 · Operations

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

This article explains the importance, pain points, architecture, core capabilities, and future roadmap of the 360 Zhihui Cloud "Yunzhou" unified alert service, showing how it improves observability, reduces alert noise, and accelerates incident response for modern cloud‑native systems.

AlertingCloud NativeIncident Response
0 likes · 14 min read
How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR
Efficient Ops
Efficient Ops
Feb 20, 2025 · Information Security

How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It

A recent case where a maintenance worker exploited device‑management flaws to steal confidential files for foreign spies highlights the need for heightened vigilance, strict self‑discipline, and prompt reporting, offering practical steps to safeguard against similar security breaches.

Incident ResponseInformation Securitydata leakage
0 likes · 4 min read
How a Maintenance Staff Leak Exposed Security Gaps and How to Prevent It
DataFunSummit
DataFunSummit
Feb 13, 2025 · Information Security

Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook

This article presents a detailed walkthrough of constructing a robust security architecture, covering single‑person security team strategies, risk perception and quantification, rapid incident response, automated detection, precise strike mechanisms, deterrence tactics, and forward‑looking plans for intelligent, data‑driven risk management.

AutomationIncident ResponseSecurity
0 likes · 21 min read
Building and Optimizing a Comprehensive Security System: Practices, Innovations, and Future Outlook
Raymond Ops
Raymond Ops
Dec 26, 2024 · Information Security

How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide

This article details a real‑world Linux server breach, describing the symptoms, investigative commands, log analysis, malicious script removal, file attribute unlocking, and practical remediation steps, while highlighting key lessons and preventive measures for future security.

Incident ResponseIntrusion DetectionLinux
0 likes · 16 min read
How to Detect and Recover from a Linux Server Intrusion: A Step‑by‑Step Guide
DevOps Operations Practice
DevOps Operations Practice
Dec 8, 2024 · Information Security

Incident Report: Investigating and Removing a Server Malware Causing 100% CPU Usage

This article documents a step‑by‑step investigation of a compromised Linux server that exhibited 100% CPU usage, detailing process, network, and startup‑service analysis, the discovery of a cryptomining malware, and the complete removal procedure.

Incident ResponseInformation SecurityLinux
0 likes · 5 min read
Incident Report: Investigating and Removing a Server Malware Causing 100% CPU Usage
Efficient Ops
Efficient Ops
Nov 12, 2024 · Operations

How to Build Robust Online Stability: Practices, Metrics, and Team Strategies

This article outlines a comprehensive approach to online stability, covering preventive measures, service governance, capacity planning, incident detection, multi‑dimensional monitoring, alerting, R&D efficiency improvements, team building, and practical guidelines for simplifying, standardizing, automating, and scaling stability initiatives across an organization.

AutomationIncident ResponseMonitoring
0 likes · 15 min read
How to Build Robust Online Stability: Practices, Metrics, and Team Strategies
Ops Development Stories
Ops Development Stories
Nov 8, 2024 · Operations

Building a Simple Cloud‑Native Alert Platform: Features, Architecture & Roadmap

This article describes the design and implementation of a lightweight cloud‑native alert platform, outlining its core features, future enhancements, system architecture, and demo screenshots, offering practical insights for SREs and operations teams handling growing monitoring workloads.

Cloud NativeIncident ResponseMonitoring
0 likes · 6 min read
Building a Simple Cloud‑Native Alert Platform: Features, Architecture & Roadmap
Java Architect Essentials
Java Architect Essentials
Oct 7, 2024 · Information Security

Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons

A disgruntled former infrastructure engineer at a U.S. industrial firm deleted backups, locked administrators, and demanded $750,000 in Bitcoin, leading to his arrest and highlighting the severe risks, legal consequences, and mitigation strategies associated with insider ransomware threats.

IT governanceIncident ResponseInformation Security
0 likes · 10 min read
Insider Ransomware Attack by a Former Engineer: Case Study and Security Lessons
Efficient Ops
Efficient Ops
Aug 20, 2024 · Information Security

Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage

This article recounts how a solo developer built a Django‑based Docker image signing service to meet PCI DSS requirements, faced two severe incidents—including a 17.5‑hour P0 outage caused by concurrency limits and a misconfigured Rekor service—and shares the operational lessons learned for reliable SRE practice.

DjangoIncident ResponsePCI DSS
0 likes · 9 min read
Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

Backend DevelopmentIncident ResponseMonitoring
0 likes · 26 min read
System Stability Practices: From Development to Production
Efficient Ops
Efficient Ops
May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

AutomationCapacity PlanningIncident Response
0 likes · 29 min read
What Is an SRE? Roles, Skills, and Best Practices Explained
Wukong Talks Architecture
Wukong Talks Architecture
Apr 15, 2024 · Operations

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

APIIncident ResponseOperations
0 likes · 8 min read
Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures
Zhuanzhuan Tech
Zhuanzhuan Tech
Feb 21, 2024 · Operations

Network Operations Incident Report: BGP Routing Failure and Resolution

This report details a network operations incident where a BGP routing change caused an EBGP neighbor to go idle, outlines the step‑by‑step troubleshooting, analysis of the root cause, and the implemented solution involving a new L3 node and redundant EBGP peers.

BGPCloud NetworkingIncident Response
0 likes · 8 min read
Network Operations Incident Report: BGP Routing Failure and Resolution
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

Incident ResponseMonitoringOperations
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
Architect
Architect
Nov 17, 2023 · Information Security

A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned

The author recounts a 2018 incident where a cloud disk snapshot was unintentionally made public, exposing customer data, and shares a detailed reflection on the operational mistakes, risk management failures, and recommended safeguards for high‑risk cloud operations.

Data SecurityIncident Responsecloud computing
0 likes · 9 min read
A Real-World Incident of Accidental Public Snapshot Sharing and Lessons Learned
JD Retail Technology
JD Retail Technology
Nov 13, 2023 · Information Security

Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices

This article outlines the red‑blue adversarial testing process for a big‑data platform during the Double‑Eleven promotion, detailing its purpose, benefits, step‑by‑step execution, common issues, and recommendations to improve system reliability and security.

Big DataChaos EngineeringIncident Response
0 likes · 12 min read
Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices
Architecture and Beyond
Architecture and Beyond
Nov 12, 2023 · Frontend Development

Designing a Yellow Banner System for User Notification During Service Outages

The article explains how a configurable yellow banner system can be used on web interfaces to promptly inform users about service disruptions, guide their actions, increase transparency, improve experience, and outline implementation considerations such as configurability, persistence, and independent deployment.

Incident Responsefrontendnotification
0 likes · 6 min read
Designing a Yellow Banner System for User Notification During Service Outages