Master Incident Management: Definitions, Processes, and Best Practices
This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.
1. Fault and Fault Management Definition
Industry fault management evolves from ITIL, streamlining processes to fit lean, iterative internet environments.
1. Definitions in ITIL
Fault : (1) Unplanned IT service interruption or performance degradation; (2) Failure of a configuration item even if it does not affect service.
Fault Management : The process for handling all faults.
Goal of Fault Management : Quickly restore services to normal operation while minimizing adverse impact on business, thereby maintaining service quality and availability.
2. More Comprehensive Industry Definitions
Fault : Any cause (except user environment or user actions) that leads to service interruption, quality degradation, or a poorer user experience.
Fault Management : A series of activities and processes covering the fault lifecycle, including fault level definition, detection, response, emergency handling, recovery, post‑mortem, and continuous improvement.
Goal of Fault Management : Prevent foreseeable problems, quickly recover from unforeseen ones, and avoid repeat occurrences.
2. Why Implement Fault Management
Both theory and practice show that if a fault can happen, it will happen. To ensure business stability, organizations must detect risks early, locate causes promptly, recover quickly, and implement effective improvements to prevent recurrence, requiring a standardized, closed‑loop fault management system.
3. How to Do Fault Management
Fault management covers the entire fault lifecycle, forming a closed‑loop system with continuous improvement.
1. Fault Level Definition
1.1 Fault Sequence
Fault management teams can define fault sequences; a typical sequence has four levels, with lower numbers indicating higher severity.
P (PRIORITY) sequence: Technical priority for overall fault handling.
D (DATA) sequence: Data‑quality sequence, combining data asset level and impact factors.
R (RISK) sequence: Public‑opinion risk sequence.
S (SLA) sequence: Measures impact on SLA severity.
1.2 Fault Grading
Using the P sequence as an example, fault grading is divided into generic and business types; business‑type grading must not be lower than generic grading.
Generic fault levels are defined by the fault management department and may include affected users, merchants, complaint increase, financial loss, etc. They serve as a fallback when business‑type scenarios are not covered.
Business fault levels are defined jointly by the fault management department and business teams from the user perspective. Internal tools can also adopt this template for inclusion in fault management.
2. Monitoring and Alerts
The core is to link business monitoring with fault level definitions to achieve timely fault detection.
Alerts should be intelligent to improve accuracy, e.g., smart thresholds, baselines, root‑cause algorithms.
3. Incident Emergency
When an issue escalates to a fault, the fault management team promptly announces the incident, creates a handling group or conference call, coordinates, follows up, and supervises resolution until recovery.
Because fault management requires 24×7 emergency response, companies can reference Google SRE or Alibaba GOC teams, distributing members across time zones for continuous coverage.
4. Incident Recovery
The primary task after a fault occurs is to restore business. Options include predefined plans, restarts, degradation, isolation, traffic shifting, or saturation‑style emergency measures.
5. Postmortem
5.1 Postmortem Timeliness
To ensure issues receive sufficient attention and improvement measures are timely, it is recommended to complete postmortems for P1‑P2 faults within one workday, P3‑P4 within three workdays, and follow the P‑sequence timing for other levels.
5.2 Preparation for Postmortem
Fault managers (postmortem hosts) should gather the following before the meeting:
Fault handling process: include injection, occurrence, detection, response, initial cause location, execution of recovery, full recovery, root‑cause identification, and any other key steps.
Business impact: specific downtime periods, decline percentages, and financial loss.
User/merchant impact: theoretical impact volume, call volume, online inquiries.
Root cause and classification: hardware failure, code issue, process gap, disaster recovery, capacity, etc.
5.3 Key Focus Points in Postmortem
Prevention: whether changes triggered the incident.
Discovery: detection time, source, monitoring optimization.
Emergency response: response duration.
Recovery: recovery time, documentation of measures, improvement.
Improvement actions: verifiable measures, deadlines, owners.
6. Continuous Operation
Beyond data display and cultural promotion, the main goal is to analyze fault data, identify weak points and risks at each lifecycle stage, and implement targeted improvements.
Examples include tightening change controls for repeated major incidents, building a rapid‑recovery culture when recovery relies heavily on code releases, and creating fast‑recovery playbooks for common fault scenarios.
4. Advice for Fault Management Practitioners
Fault management is a long and challenging journey; the following suggestions aim to help practitioners succeed.
Be proactive and responsible.
Unfollowed risks and problems increase fault frequency.
Poor fault follow‑up expands impact.
Unclear root cause leads to ineffective improvements.
Ineffective measures cause repeat incidents.
Dare to question.
Is monitoring detection timely?
Can the handling process be optimized? Any human errors?
Is the business impact assessment accurate?
Is the identified cause truly the root cause?
Are improvement measures reasonable?
Self‑improvement. Fault managers should act like architects, pinpointing issues at every stage and independently driving optimization projects.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.