From Firefighting to Arson: Mastering Ops Availability in Three Stages
The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.
01 The Original Intent of Operations
Availability is the foundation of ops; when a service is down, any effort is wasted.
Ops availability capability evolves through three stages:
Firefighting Stage – Keep MTTR of core modules under 20 minutes. In this early stage, engineers spend most time locating faults, relying heavily on personal experience and understanding of service dependencies.
Fire‑Prevention Stage – Focus on runbooks, high‑availability design, disaster recovery, automated alerts, and graceful degradation. Faults are often identified before full investigation, allowing rapid isolation or downgrade, resulting in much shorter MTTR.
Arson Stage – Aim to keep services stable by deliberately creating controlled failures; the goal is to build resilience rather than let the system self‑destruct.
Long‑running stable systems often hide catastrophic “black‑swan” failures; lack of incident experience can turn a minor issue into a major disaster.
Prerequisite for the Arson Stage : The team must have passed the firefighting stage and have clear, documented remediation procedures.
How to practice : Conduct manual fault injection drills. Simulate failures in production, observe impact, and verify that response processes meet expectations.
Common misconception : Assigning a “blue team” to create faults and a “red team” to fix them without communication leads to meaningless exercises.
Fault‑drill workflow : Reduce traffic → inject fault → intervene → recover → restore traffic → post‑mortem.
Long‑term, use a platform to inject random faults without prior notice (e.g., Netflix’s Chaos Monkey).
02 Continuously Raising Availability
With availability as the goal, ops can infiltrate many valuable activities:
Offline environments (development, testing, pre‑release)
Release strategies (canary, staged rollouts)
Rapid loss mitigation
All incidents stem from changes—code, environment, network, hardware degradation, or metric thresholds. The challenge is to detect and resolve these changes swiftly.
Relying on intuition to locate faults creates two problems: inexperienced staff cannot find issues quickly, and luck dominates, making MTTR unpredictable.
Solution : Real‑time system dashboards that visualize metrics, configuration, and product health (e.g., Grafana + Prometheus), making every change instantly visible and traceable.
03 The Purpose of Operations
Automation, tooling, and platforms are means, not ends.
The true purpose is to continuously enhance product value throughout its lifecycle, thereby increasing the ops team’s impact.
If ops is seen only as hard work, its value is missed. Identify the root causes of toil—process inefficiencies or low efficiency—and strive for achievement, not sympathy.
How to Demonstrate Your Value
In reports, focus on outcomes and business impact rather than task lists; explain the value generated, future potential, and the path to optimal performance.
Technical Metrics vs. Business Metrics
Metrics like QPS or load must be translated into core business indicators (revenue, PV, brand influence). For example, show how an optimization expands cluster capacity and reduces resource cost.
Availability is the face of an ops engineer; maintaining it while creating product value defines a competent internet ops professional.
Source: Adapted from the “High‑Efficiency Ops” public account.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.