Skip to main content

Incident Management — From Detection to Learning

A structured process to detect, respond to, and learn from production failures — fast mitigation first, root cause second.

When to use

  • Any production failure affecting users
  • Establishing on-call culture and escalation paths

Tradeoffs

  • Over-process slows mitigation — stop the bleeding first, document later
  • Under-process leads to chaos, no learning, and the same incident repeating

Severity levels

SeverityConditionResponse
SEV1Full outage or critical data lossAll-hands, immediate escalation
SEV2Major feature degraded, large user impactOn-call + team lead
SEV3Partial degradation, workaround existsOn-call handles
SEV4Minor impact, no user-facing effectTicket, no page

Roles

  • Incident Commander (IC): coordinates response, owns communication, unblocks responders
  • Tech Lead: drives mitigation and root cause investigation
  • Comms: updates status page and stakeholders

Postmortem template

## Postmortem: [Incident Title]

**Date:** | **Severity:** | **Duration:**

**Impact:** (users affected, revenue impact, SLO impact)

**Timeline:** (key events with timestamps)

**Root Cause:** (system cause, not human error)

**Contributing Factors:**

**Action Items:** (owner, due date, prevents recurrence)

Gotcha: "Human error" is never a root cause. It's a symptom of a system that made the error easy to make. Ask why the system allowed it.