Evaluating a System — The Four Golden Signals + Capacity
Assess system health by measuring the four golden signals: latency, traffic, errors, saturation — then add capacity headroom and operational burden.
When to use
- Quarterly architecture reviews
- Pre-migration assessment
- Incident retrospectives to baseline system state
Tradeoffs
- Metrics without context (baseline, trends, seasonality) are misleading
- Capacity planning requires understanding traffic growth curves, not just current state
- Go
- Python
// Prometheus queries for the four golden signals
// Latency — p99 request duration
// histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
// Traffic — requests per second
// rate(http_requests_total[5m])
// Errors — error rate
// rate(http_requests_total{status=~"5.."}[5m])
// / rate(http_requests_total[5m])
// Saturation — CPU utilization
// 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
// Capacity headroom — memory available vs total
// node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
func goldenSignalAlert(p99Ms, errorRate, cpuUtil float64) string {
if p99Ms > 1000 || errorRate > 0.01 || cpuUtil > 0.80 {
return "degraded"
}
return "healthy"
}
# Prometheus queries for the four golden signals
# Latency — p99 request duration
# histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Traffic — requests per second
# rate(http_requests_total[5m])
# Errors — error rate
# rate(http_requests_total{status=~"5.."}[5m])
# / rate(http_requests_total[5m])
# Saturation — CPU utilization
# 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Capacity headroom
# node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
def golden_signal_status(p99_ms: float, error_rate: float, cpu_util: float) -> str:
if p99_ms > 1000 or error_rate > 0.01 or cpu_util > 0.80:
return "degraded"
return "healthy"
Gotcha: A system with 99.9% success rate sounds great — until you realize that's 1,000 errors/hour at 1M RPM. Always contextualize percentages with absolute numbers.