Mastering On-Call: Practical Lessons from Google SRE for Effective Ops
This article shares practical insights from Google SRE on on‑call duty, covering why on‑call is needed, common challenges, effective scheduling, evaluation methods, and actionable tips to improve team resilience and reduce stress for operations engineers.
Preface
This is the third reflection on the book “SRE Google Operations Secrets”, focusing on on‑call duty. The author, still involved in frontline operations, shares personal experiences and lessons.
On‑Call Duty
Operations staff must ensure online services remain stable and respond to incidents immediately, requiring 24/7 coverage. Since humans need rest, on‑call rotations are introduced so that a designated person stays ready to handle emergencies while the rest of the team can rest.
How to Do On‑Call
Keep your phone reachable and have your laptop and VPN with you.
Try to solve problems yourself; if you cannot, seek help promptly.
Avoid excessive mental pressure; challenges help you grow.
Challenges of On‑Call
On‑call engineers face psychological pressure from unpredictable incidents, a tendency to become complacent when no alerts occur, and the need to stay ready even during personal activities. These factors make on‑call feel demanding.
On‑Call Issues
Key questions include whether a single person or a pair should be on‑call, and the appropriate rotation length (daily vs. weekly). A single on‑call person has clear responsibility but risks single‑point failure; a pair provides backup but may lead to task shifting.
Short daily rotations prevent fatigue but can cause incomplete handovers, while longer weekly rotations encourage deeper problem resolution.
How to Evaluate On‑Call
Beyond incident reports, true evaluation should consider root‑cause elimination, system optimization, and team efficiency improvements.
Responsibility Misconceptions
On‑call is not just about technical skill; it tests responsiveness and diligence. Overemphasis on “responsibility” can lead to short‑term incentives that fade over time. Better approaches focus on processes and tools to boost efficiency.
SRE Insights
Google SRE uses a primary‑secondary on‑call model, with weekly (or longer) rotations, structured problem‑assessment processes, standardized communication, a focus on solving issues rather than blaming individuals, and distributed teams across time zones to share the load.
These practices aim to mitigate human weaknesses, improve on‑call experience, and enhance overall operational effectiveness.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.