Skip to main content

SLI / SLO / SLA / Error Budgets — Measuring Reliability

SLI = what you measure. SLO = your target. SLA = contractual promise. Error budget = how much you can fail before violating SLO.

When to use

  • Any production service requiring reliability guarantees
  • Making the velocity vs reliability tradeoff explicit and data-driven

Tradeoffs

  • Wrong SLIs measure activity, not user impact (e.g., CPU instead of request latency)
  • SLO too tight = no error budget for deploys; too loose = false confidence
ConceptDefinitionExample
SLIMeasured ratio of good events / total eventssuccessful requests / total requests
SLOInternal target for SLI99.9% of requests succeed within 200ms
SLAExternal commitment with consequences99.5% uptime or credit issued
Error budget(1 − SLO) × time window43.8 minutes downtime/month at 99.9%
const (
sloTarget = 0.999 // 99.9%
windowMinutes = 30 * 24 * 60 // 30-day window
)

type BudgetStatus struct {
TotalMinutes float64
UsedMinutes float64
RemainingMinutes float64
Exhausted bool
}

func CalcBudget(errorRate float64) BudgetStatus {
total := (1 - sloTarget) * windowMinutes
used := errorRate * windowMinutes
remaining := total - used
return BudgetStatus{total, used, remaining, remaining <= 0}
}

Gotcha: If you're never spending your error budget, your SLO is too conservative — you're over-investing in reliability at the cost of shipping velocity.