Operations 8 min read

From Firefighting to Arson: Mastering Ops Availability in Three Stages

The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.

Efficient Ops

Sep 8, 2020

From Firefighting to Arson: Mastering Ops Availability in Three Stages

01 The Original Intent of Operations

Availability is the foundation of ops; when a service is down, any effort is wasted.

Ops availability capability evolves through three stages:

Firefighting Stage – Keep MTTR of core modules under 20 minutes. In this early stage, engineers spend most time locating faults, relying heavily on personal experience and understanding of service dependencies.

Fire‑Prevention Stage – Focus on runbooks, high‑availability design, disaster recovery, automated alerts, and graceful degradation. Faults are often identified before full investigation, allowing rapid isolation or downgrade, resulting in much shorter MTTR.

Arson Stage – Aim to keep services stable by deliberately creating controlled failures; the goal is to build resilience rather than let the system self‑destruct.

Long‑running stable systems often hide catastrophic “black‑swan” failures; lack of incident experience can turn a minor issue into a major disaster.

Prerequisite for the Arson Stage : The team must have passed the firefighting stage and have clear, documented remediation procedures.

How to practice : Conduct manual fault injection drills. Simulate failures in production, observe impact, and verify that response processes meet expectations.

Common misconception : Assigning a “blue team” to create faults and a “red team” to fix them without communication leads to meaningless exercises.

Fault‑drill workflow : Reduce traffic → inject fault → intervene → recover → restore traffic → post‑mortem.

Long‑term, use a platform to inject random faults without prior notice (e.g., Netflix’s Chaos Monkey).

02 Continuously Raising Availability

With availability as the goal, ops can infiltrate many valuable activities:

Offline environments (development, testing, pre‑release)

Release strategies (canary, staged rollouts)

Rapid loss mitigation

All incidents stem from changes—code, environment, network, hardware degradation, or metric thresholds. The challenge is to detect and resolve these changes swiftly.

Relying on intuition to locate faults creates two problems: inexperienced staff cannot find issues quickly, and luck dominates, making MTTR unpredictable.

Solution : Real‑time system dashboards that visualize metrics, configuration, and product health (e.g., Grafana + Prometheus), making every change instantly visible and traceable.

03 The Purpose of Operations

Automation, tooling, and platforms are means, not ends.

The true purpose is to continuously enhance product value throughout its lifecycle, thereby increasing the ops team’s impact.

If ops is seen only as hard work, its value is missed. Identify the root causes of toil—process inefficiencies or low efficiency—and strive for achievement, not sympathy.

How to Demonstrate Your Value

In reports, focus on outcomes and business impact rather than task lists; explain the value generated, future potential, and the path to optimal performance.

Technical Metrics vs. Business Metrics

Metrics like QPS or load must be translated into core business indicators (revenue, PV, brand influence). For example, show how an optimization expands cluster capacity and reduces resource cost.

Availability is the face of an ops engineer; maintaining it while creating product value defines a competent internet ops professional.

Source: Adapted from the “High‑Efficiency Ops” public account.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Incident Management site reliability Fault Injection Availability

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.