Alerting & On-Call — Signal vs Noise
Alert on symptoms that affect users, not on causes. Every page must have an owner and a runbook.
When to use
- Defining what wakes on-call engineers
- Reviewing and pruning alert rules quarterly
Tradeoffs
- Too few alerts = blind spots; too many = alert fatigue, pages get ignored
- SLO-based alerting requires SLOs to be defined first
| Anti-pattern | Better alternative |
|---|---|
| Alert on CPU > 80% | Alert on latency p99 > SLO threshold |
| Alert on every error | Alert on error rate exceeding burn rate |
| No runbook | Runbook URL in every alert annotation |
# Prometheus alert: SLO burn rate — fires when error budget burns 5x faster than normal
groups:
- name: slo_alerts
rules:
- alert: HighErrorBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h])) > 5 * (1 - 0.999)
for: 5m
labels:
severity: page
team: backend
annotations:
summary: "High SLO error burn rate on {{ $labels.service }}"
runbook: "https://wiki.example.com/runbooks/high-error-burn-rate"
description: "Error budget burning at {{ $value | humanizePercentage }} of allowed rate."
Gotcha: Alerts that fire and get acknowledged without action are noise. Track alert-to-action rate. If an alert fires 10x without a fix, it should be removed or the root cause fixed.