Alerting & On-Call — Signal vs Noise

Alert on symptoms that affect users, not on causes. Every page must have an owner and a runbook.

When to use

Defining what wakes on-call engineers
Reviewing and pruning alert rules quarterly

Tradeoffs

Too few alerts = blind spots; too many = alert fatigue, pages get ignored
SLO-based alerting requires SLOs to be defined first

Anti-pattern	Better alternative
Alert on CPU > 80%	Alert on latency p99 > SLO threshold
Alert on every error	Alert on error rate exceeding burn rate
No runbook	Runbook URL in every alert annotation

# Prometheus alert: SLO burn rate — fires when error budget burns 5x faster than normal
groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorBurnRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h])) > 5 * (1 - 0.999)
        for: 5m
        labels:
          severity: page
          team: backend
        annotations:
          summary: "High SLO error burn rate on {{ $labels.service }}"
          runbook: "https://wiki.example.com/runbooks/high-error-burn-rate"
          description: "Error budget burning at {{ $value | humanizePercentage }} of allowed rate."

Gotcha: Alerts that fire and get acknowledged without action are noise. Track alert-to-action rate. If an alert fires 10x without a fix, it should be removed or the root cause fixed.