SLI / SLO / SLA / Error Budgets — Measuring Reliability

SLI = what you measure. SLO = your target. SLA = contractual promise. Error budget = how much you can fail before violating SLO.

When to use

Any production service requiring reliability guarantees
Making the velocity vs reliability tradeoff explicit and data-driven

Tradeoffs

Wrong SLIs measure activity, not user impact (e.g., CPU instead of request latency)
SLO too tight = no error budget for deploys; too loose = false confidence

Concept	Definition	Example
SLI	Measured ratio of good events / total events	successful requests / total requests
SLO	Internal target for SLI	99.9% of requests succeed within 200ms
SLA	External commitment with consequences	99.5% uptime or credit issued
Error budget	(1 − SLO) × time window	43.8 minutes downtime/month at 99.9%

Go
Python

const (
    sloTarget     = 0.999        // 99.9%
    windowMinutes = 30 * 24 * 60 // 30-day window
)

type BudgetStatus struct {
    TotalMinutes     float64
    UsedMinutes      float64
    RemainingMinutes float64
    Exhausted        bool
}

func CalcBudget(errorRate float64) BudgetStatus {
    total := (1 - sloTarget) * windowMinutes
    used := errorRate * windowMinutes
    remaining := total - used
    return BudgetStatus{total, used, remaining, remaining <= 0}
}

SLO_TARGET = 0.999
WINDOW_MINUTES = 30 * 24 * 60  # 30-day window

def calc_budget(error_rate: float) -> dict:
    total = (1 - SLO_TARGET) * WINDOW_MINUTES
    used = error_rate * WINDOW_MINUTES
    remaining = total - used
    return {
        "total_minutes": round(total, 2),
        "used_minutes": round(used, 2),
        "remaining_minutes": round(remaining, 2),
        "exhausted": remaining <= 0,
    }

Gotcha: If you're never spending your error budget, your SLO is too conservative — you're over-investing in reliability at the cost of shipping velocity.