Skip to main content

Alerting & On-Call — Signal vs Noise

Alert on symptoms that affect users, not on causes. Every page must have an owner and a runbook.

When to use

  • Defining what wakes on-call engineers
  • Reviewing and pruning alert rules quarterly

Tradeoffs

  • Too few alerts = blind spots; too many = alert fatigue, pages get ignored
  • SLO-based alerting requires SLOs to be defined first
Anti-patternBetter alternative
Alert on CPU > 80%Alert on latency p99 > SLO threshold
Alert on every errorAlert on error rate exceeding burn rate
No runbookRunbook URL in every alert annotation
# Prometheus alert: SLO burn rate — fires when error budget burns 5x faster than normal
groups:
- name: slo_alerts
rules:
- alert: HighErrorBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h])) > 5 * (1 - 0.999)
for: 5m
labels:
severity: page
team: backend
annotations:
summary: "High SLO error burn rate on {{ $labels.service }}"
runbook: "https://wiki.example.com/runbooks/high-error-burn-rate"
description: "Error budget burning at {{ $value | humanizePercentage }} of allowed rate."

Gotcha: Alerts that fire and get acknowledged without action are noise. Track alert-to-action rate. If an alert fires 10x without a fix, it should be removed or the root cause fixed.