Exploring On‑Call Duty Models and SRE‑Driven Operations Management
This article examines the challenges of traditional on‑call duty systems for operations teams, proposes an SRE‑inspired rotation model that involves developers, defines concrete KPI targets, and describes how automation and chat‑bot tools can streamline incident response and reduce internal friction.
Today is the holiday for operations engineers, and the author recommends the guidebook "The Way of High Performance" to help ops staff work more efficiently.
Operations staff often have to carry laptops on weekend outings, staying on standby for emergencies. In 2014, a team of over ten ops members trekked up Qingcheng Mountain with five‑kilogram laptops, highlighting an incomplete on‑call system.
The problems identified include:
Multiple business lines requiring different ops personnel.
Feedback from developers, customer service, and testers creating a chaotic process that leaves ops in a passive role.
Accumulated issues that hinder internal ops construction and increase internal friction.
Initial attempts such as a dedicated on‑call phone line failed because the on‑call person could not resolve all issues and still needed to coordinate with other ops or teams, reducing efficiency. The breakthrough came after learning about Alibaba's NOC model for local life merchants.
The article will practice the implementation of a rotating operations culture and amplify its value from an SRE perspective.
Investigation of On‑Call Models
The core purpose of on‑call duty is to ensure immediate response when production incidents occur and to distribute urgent tasks to the appropriate teams, relieving pressure on others. However, because incidents are rare, on‑call staff may become complacent, leading to skill decay and ineffective emergency handling.
Note: In daily emergencies, missed calls or unavailable computers are common, so optimizing on‑call responsiveness is crucial for pre‑emptive governance.
Involving Developers in On‑Call
Many on‑call systems only include ops staff, rarely involving developers, which creates friction when a fault occurs and developers must be pulled in without being on standby, extending incident resolution time.
The article suggests the following improvements:
Assign on‑call personnel per business line based on technical and business structures.
Developers on duty must be available 24/7 with primary and backup roles.
Developer on‑call duties include monitoring service logs, responding to incidents, documenting fault scenarios, and synchronizing on‑call rosters with other teams, especially customer service.
Defining KPI
Stability metrics often exist only on paper and are not effectively enforced, leading to suboptimal stability improvements. The author proposes making the on‑call operation culture the first KPI target, allowing each business line to experience front‑line emergencies and understand the link between stability work and KPI value.
Example KPI goals:
Increase stability by 10% compared to the previous quarter.
Ensure 95% of incidents can be matched to the responsible developer through fault descriptions.
Achieve a 1‑minute response time for on‑call alerts, improving the qualified response rate by 30%.
Boundary Discussion for On‑Call Personnel
Common dilemmas include handling tasks outside one’s expertise and deciding whether to intervene directly or let others lead, which can affect the speed of “quick stop‑bleeding” actions.
The author presents a practical workflow (see image) to help on‑call staff manage tasks they are not proficient in, encouraging the creation of tool‑based solutions or comprehensive knowledge bases.
On‑Call Bot
Typical on‑call pain points include:
On‑call rosters sent via email, making it hard for other teams to find the right person.
Unreachable on‑call staff requiring fallback lists.
Inability to monitor devices continuously, yet needing a 1‑minute response.
To reduce lookup costs, the team added an enterprise bot that serves as the on‑call interface. The bot can:
Publish daily on‑call personnel information.
Interact with customer‑service messages, parse fault keywords, and automatically locate the appropriate on‑call staff, escalating if the primary contact does not answer within a minute.
Require on‑call staff to sign in for each incident, ensuring presence.
Automatically create incident groups with relevant participants based on fault descriptions.
Provide various SRE automation tools.
Excerpt from "The Way of High Performance: Operations Architecture Practice from an SRE Perspective" (Electronic Industry Press). The book offers valuable SRE‑focused operational architecture experience and is highly recommended.
If you enjoyed this article, feel free to comment, like, and share it.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.