Google Incident Postmortem Checklist
The article presents a detailed Google‑derived post‑mortem checklist covering event data collection, root‑cause analysis, lessons learned, actionable improvement items, and review procedures to ensure systematic, non‑blame‑focused incident handling.
Having participated in many team retrospectives, the author observes that post‑mortems, especially for failures or incidents, often yield disappointing results, with leaders lacking skills, companies missing tooling for timeline reconstruction, and meetings devolving into blame‑shifting.
The following Google‑originated post‑mortem checklist is intended to be applied continuously, emphasizing that a post‑mortem should be more than a single meeting.
Post‑mortem Checklist
1. Event Data Collection
❐ Summarize the most significant impacts in the executive summary. ❐ Define the impact scope: affected users, regions, and customers. ❐ Clearly state the severity level. ❐ Provide a complete event timeline to calculate the MTTx metric (Mean Time to …).
2. Root‑Cause Analysis
❐ Thoroughly describe all fundamental causes that led to the incident. ❐ Apply the “5 Whys” method or other root‑cause techniques to ensure sufficient depth. ❐ Identify the trigger point. ❐ Classify the incident into its root‑cause category.
3. Lessons Learned and Action‑Item Design
❐ Identify what was done well, what was ineffective, and any lucky factors. ❐ Derive improvement actions from these learnings. ❐ Ensure each action item is linked to a tracking system for follow‑up. ❐ Verify that actions cover both mitigation and prevention.
4. Action‑Item Review Checklist
❐ Are the actions realistic and reviewed by owners? ❐ Have improvements for prevention and resolution time been considered? ❐ Are similar or related incidents accounted for with corresponding plans? ❐ Have automation methods been explored to avoid human error? ❐ Does the post‑mortem contain at least one high‑priority action, and if not, have stakeholders accepted the residual risk? ❐ Have you consulted the owners responsible for executing the actions?
5. Review / Approval / Publication
❐ Has the post‑mortem passed your team’s review or approval process? ❐ Have any accusatory language been removed or revised? ❐ Was the result shared with the original incident stakeholders? ❐ Was the result shared with the broader team? ❐ Is the report accessible on a dashboard or similar shared tool? ❐ Is the post‑mortem non‑blaming and focused on systemic improvement?
Glossary
● Severity level – a metric to help analyze incident seriousness. ● 5‑Whys – a technique for digging deeper into root causes (see Wikipedia). ● Trigger point – the moment the incident began affecting production. ● Similar incident – events of comparable nature that may not be exact repeats. ● Executive summary – a report for senior leaders unfamiliar with production details. ● MTTx – average time for a specific stage (discovery, escalation, mitigation, resolution).
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.