Tagged articles
8 articles
Page 1 of 1
Architect-Kip
Architect-Kip
Mar 4, 2026 · Operations

Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response

This guide outlines comprehensive SRE monitoring and alerting standards, covering core principles, log instrumentation, health‑check requirements, baseline resource and application metrics, alarm severity tiers, response SLAs, on‑call rotation, continuous optimization, and noise‑reduction mechanisms to ensure reliable service operation.

MetricsMonitoringOperations
0 likes · 14 min read
Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response
DevOps Coach
DevOps Coach
Dec 1, 2025 · Backend Development

Designing for the Worst Day: Mission‑Critical Backend Practices

This article explores how mission‑critical backend engineers shift from sprint‑focused development to designing systems for the worst‑case scenario, outlining three hard rules, four practical habits, concrete code examples, and actionable steps for ordinary teams to improve reliability and safety.

backendbest-practicescode-quality
0 likes · 12 min read
Designing for the Worst Day: Mission‑Critical Backend Practices
Efficient Ops
Efficient Ops
Nov 12, 2024 · Operations

How to Build Robust Online Stability: Practices, Metrics, and Team Strategies

This article outlines a comprehensive approach to online stability, covering preventive measures, service governance, capacity planning, incident detection, multi‑dimensional monitoring, alerting, R&D efficiency improvements, team building, and practical guidelines for simplifying, standardizing, automating, and scaling stability initiatives across an organization.

Team Collaborationincident-responsestability
0 likes · 15 min read
How to Build Robust Online Stability: Practices, Metrics, and Team Strategies
ITPUB
ITPUB
Nov 17, 2023 · Operations

How Bilibili Overcame a Massive CDN Outage: Cloud‑Edge Incident Response Lessons

This article details the August 2023 Bilibili CDN failure, analyzes its root causes, describes the 1‑5‑10 emergency recovery framework, and presents cloud‑side SLB/BFS optimizations and edge‑side scheduling and fallback strategies that together restored service and improved future resilience.

CDNOperationscloud-native
0 likes · 20 min read
How Bilibili Overcame a Massive CDN Outage: Cloud‑Edge Incident Response Lessons
ITPUB
ITPUB
Jun 30, 2023 · Operations

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

MonitoringReliabilityautomation
0 likes · 30 min read
How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook
ITPUB
ITPUB
Aug 5, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.

Load BalancerOpenRestySRE
0 likes · 17 min read
How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
ITPUB
ITPUB
Feb 25, 2019 · Information Security

How the DDG Variant Malware Infects Linux Servers and How to Clean It

In February 2019 a DDG‑variant cryptomining worm spread across Linux servers by exploiting unauthenticated Redis instances, hijacking system binaries via LD_PRELOAD, and using SSH known_hosts for lateral movement, prompting a detailed technical analysis and step‑by‑step remediation guide.

cryptocurrencyincident-response
0 likes · 11 min read
How the DDG Variant Malware Infects Linux Servers and How to Clean It