Skip to main content

Evaluating a System — The Four Golden Signals + Capacity

Assess system health by measuring the four golden signals: latency, traffic, errors, saturation — then add capacity headroom and operational burden.

When to use

  • Quarterly architecture reviews
  • Pre-migration assessment
  • Incident retrospectives to baseline system state

Tradeoffs

  • Metrics without context (baseline, trends, seasonality) are misleading
  • Capacity planning requires understanding traffic growth curves, not just current state

// Prometheus queries for the four golden signals

// Latency — p99 request duration
// histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

// Traffic — requests per second
// rate(http_requests_total[5m])

// Errors — error rate
// rate(http_requests_total{status=~"5.."}[5m])
// / rate(http_requests_total[5m])

// Saturation — CPU utilization
// 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

// Capacity headroom — memory available vs total
// node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

func goldenSignalAlert(p99Ms, errorRate, cpuUtil float64) string {
if p99Ms > 1000 || errorRate > 0.01 || cpuUtil > 0.80 {
return "degraded"
}
return "healthy"
}

Gotcha: A system with 99.9% success rate sounds great — until you realize that's 1,000 errors/hour at 1M RPM. Always contextualize percentages with absolute numbers.