Incident Management — From Detection to Learning
A structured process to detect, respond to, and learn from production failures — fast mitigation first, root cause second.
When to use
- Any production failure affecting users
- Establishing on-call culture and escalation paths
Tradeoffs
- Over-process slows mitigation — stop the bleeding first, document later
- Under-process leads to chaos, no learning, and the same incident repeating
Severity levels
| Severity | Condition | Response |
|---|---|---|
| SEV1 | Full outage or critical data loss | All-hands, immediate escalation |
| SEV2 | Major feature degraded, large user impact | On-call + team lead |
| SEV3 | Partial degradation, workaround exists | On-call handles |
| SEV4 | Minor impact, no user-facing effect | Ticket, no page |
Roles
- Incident Commander (IC): coordinates response, owns communication, unblocks responders
- Tech Lead: drives mitigation and root cause investigation
- Comms: updates status page and stakeholders
Postmortem template
## Postmortem: [Incident Title]
**Date:** | **Severity:** | **Duration:**
**Impact:** (users affected, revenue impact, SLO impact)
**Timeline:** (key events with timestamps)
**Root Cause:** (system cause, not human error)
**Contributing Factors:**
**Action Items:** (owner, due date, prevents recurrence)
Gotcha: "Human error" is never a root cause. It's a symptom of a system that made the error easy to make. Ask why the system allowed it.