Incident Management — From Detection to Learning

A structured process to detect, respond to, and learn from production failures — fast mitigation first, root cause second.

When to use

Any production failure affecting users
Establishing on-call culture and escalation paths

Tradeoffs

Over-process slows mitigation — stop the bleeding first, document later
Under-process leads to chaos, no learning, and the same incident repeating

Severity levels

Severity	Condition	Response
SEV1	Full outage or critical data loss	All-hands, immediate escalation
SEV2	Major feature degraded, large user impact	On-call + team lead
SEV3	Partial degradation, workaround exists	On-call handles
SEV4	Minor impact, no user-facing effect	Ticket, no page

Roles

Incident Commander (IC): coordinates response, owns communication, unblocks responders
Tech Lead: drives mitigation and root cause investigation
Comms: updates status page and stakeholders

Postmortem template

## Postmortem: [Incident Title]

**Date:** | **Severity:** | **Duration:**

**Impact:** (users affected, revenue impact, SLO impact)

**Timeline:** (key events with timestamps)

**Root Cause:** (system cause, not human error)

**Contributing Factors:**

**Action Items:** (owner, due date, prevents recurrence)

Gotcha: "Human error" is never a root cause. It's a symptom of a system that made the error easy to make. Ask why the system allowed it.