SLI / SLO / SLA / Error Budgets — Measuring Reliability
SLI = what you measure. SLO = your target. SLA = contractual promise. Error budget = how much you can fail before violating SLO.
When to use
- Any production service requiring reliability guarantees
- Making the velocity vs reliability tradeoff explicit and data-driven
Tradeoffs
- Wrong SLIs measure activity, not user impact (e.g., CPU instead of request latency)
- SLO too tight = no error budget for deploys; too loose = false confidence
| Concept | Definition | Example |
|---|---|---|
| SLI | Measured ratio of good events / total events | successful requests / total requests |
| SLO | Internal target for SLI | 99.9% of requests succeed within 200ms |
| SLA | External commitment with consequences | 99.5% uptime or credit issued |
| Error budget | (1 − SLO) × time window | 43.8 minutes downtime/month at 99.9% |
- Go
- Python
const (
sloTarget = 0.999 // 99.9%
windowMinutes = 30 * 24 * 60 // 30-day window
)
type BudgetStatus struct {
TotalMinutes float64
UsedMinutes float64
RemainingMinutes float64
Exhausted bool
}
func CalcBudget(errorRate float64) BudgetStatus {
total := (1 - sloTarget) * windowMinutes
used := errorRate * windowMinutes
remaining := total - used
return BudgetStatus{total, used, remaining, remaining <= 0}
}
SLO_TARGET = 0.999
WINDOW_MINUTES = 30 * 24 * 60 # 30-day window
def calc_budget(error_rate: float) -> dict:
total = (1 - SLO_TARGET) * WINDOW_MINUTES
used = error_rate * WINDOW_MINUTES
remaining = total - used
return {
"total_minutes": round(total, 2),
"used_minutes": round(used, 2),
"remaining_minutes": round(remaining, 2),
"exhausted": remaining <= 0,
}
Gotcha: If you're never spending your error budget, your SLO is too conservative — you're over-investing in reliability at the cost of shipping velocity.