Best Practice: Defining and Measuring SLOs & SLAs
Context
Service Level Objectives (SLOs) are the measurable foundation of every reliability strategy. Without clear SLOs, a team cannot decide whether too much or too little is being invested in reliability. Without error budgets, the operational framework for velocity-vs-stability decisions is missing.
Typical problems without a structured SLO practice:
-
Incidents are dismissed as "normal" because no shared understanding of "unacceptable degradation" exists
-
Reliability investments are prioritized by political pressure, not data
-
SLAs towards customers are supported by estimates, not measurements
-
On-call teams do not know severity thresholds for escalation
Related Controls
-
WAF-REL-010 – SLA & SLO Definition Documented
-
WAF-REL-100 – Reliability Debt Register & Quarterly Review
Target State
A mature SLO program is:
-
Service-specific: Every production service has an SLO document
-
Measurable: SLIs are continuously calculated from real request data
-
Operationally anchored: Error budget burn rate alerts drive release decisions
-
Stakeholder-communicated: SLOs and error budget status regularly visible
Technical Implementation
Step 1: Structure the SLO Document
# docs/slos/payment-api-slo.yml
service: "payment-api"
version: "1.2"
effective_date: "2026-01-01"
owner: "payments-team"
reviewed_by: "architecture-board"
slos:
availability:
description: "Percentage of requests returning 2xx or 3xx responses"
sli: "sum(rate(http_requests_total{status=~'2..|3..'}[5m])) / sum(rate(http_requests_total[5m]))"
target: 0.999 # 99.9%
window: "30d"
error_budget_minutes: 43.2
latency_p99:
description: "99th percentile request latency"
sli: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
target_seconds: 0.5 # < 500ms
coverage: 0.99 # 99% of requests must meet this
error_rate:
description: "Fraction of requests resulting in 5xx errors"
sli: "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
target: 0.001 # < 0.1% error rate
window: "30d"
alert_policy:
burn_rate_fast:
window: "1h"
burn_rate: 14.4 # Will exhaust monthly budget in 2 days
severity: "SEV2"
burn_rate_slow:
window: "6h"
burn_rate: 6 # Will exhaust monthly budget in 5 days
severity: "SEV3"
Step 2: SLO Monitoring with Prometheus
# prometheus/rules/slo-payment-api.yml
groups:
- name: slo.payment-api
interval: 30s
rules:
# SLI: Availability ratio
- record: job:slo_availability:ratio_rate5m
expr: >
sum(rate(http_requests_total{job="payment-api",status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total{job="payment-api"}[5m]))
# Error budget burn rate (fast window)
- alert: SloBurnRateFast
expr: >
(
(1 - job:slo_availability:ratio_rate5m) / (1 - 0.999)
) > 14.4
for: 5m
labels:
severity: critical
service: payment-api
annotations:
summary: "Payment API: Fast error budget burn rate (>14.4x)"
description: "Current burn rate {{ $value | humanize }}x. Monthly budget exhausted in {{ printf '%.1f' (30 / $value) }} days."
runbook: "https://wiki.example.com/runbooks/payment-api-slo-burn"
# Error budget burn rate (slow window)
- alert: SloBurnRateSlow
expr: >
(
avg_over_time((1 - job:slo_availability:ratio_rate5m)[6h:5m])
/ (1 - 0.999)
) > 6
for: 15m
labels:
severity: warning
service: payment-api
annotations:
summary: "Payment API: Slow error budget burn rate (>6x)"
runbook: "https://wiki.example.com/runbooks/payment-api-slo-burn"
Step 3: CloudWatch SLO Monitoring (AWS)
# terraform/monitoring/slo-alarms.tf
resource "aws_cloudwatch_metric_alarm" "slo_error_rate_fast_burn" {
alarm_name = "slo-payment-api-fast-burn"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5
datapoints_to_alarm = 4
metric_query {
id = "error_rate"
return_data = true
metric {
metric_name = "5XXError"
namespace = "AWS/ApiGateway"
period = 60
stat = "Sum"
dimensions = {
ApiName = "payment-api"
Stage = "production"
}
}
}
threshold = 10
alarm_description = "SLO: Fast burn rate – Payment API 5XX errors. Runbook: https://wiki/runbooks/payment-api-slo"
alarm_actions = [aws_sns_topic.oncall_critical.arn]
ok_actions = [aws_sns_topic.oncall_critical.arn]
treat_missing_data = "notBreaching"
tags = var.mandatory_tags
}
Step 4: Grafana SLO Dashboard
{
"dashboard": {
"title": "Payment API – SLO Dashboard",
"panels": [
{
"title": "Availability (30d Rolling)",
"type": "stat",
"targets": [{
"expr": "avg_over_time(job:slo_availability:ratio_rate5m[30d]) * 100",
"legendFormat": "Availability %"
}],
"thresholds": {"mode": "absolute", "steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 99.5},
{"color": "green", "value": 99.9}
]}
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [{
"expr": "(1 - (1 - avg_over_time(job:slo_availability:ratio_rate5m[30d])) / (1 - 0.999)) * 100"
}]
}
]
}
}
Typical Anti-Patterns
-
SLO set too ambitiously: 99.99% SLO without corresponding infrastructure leads to a permanently empty error budget
-
SLI measures the wrong metric: Load balancer healthcheck as SLI instead of real user journey
-
No error budget policy: SLO defined but no operational consequences when budget is exhausted
-
SLO never reviewed: SLOs that remain unchanged after a year usually do not reflect current reality
Metrics
-
Error Budget Remaining: % of monthly budget still available (target: > 50% at mid-month)
-
Burn Rate: Current consumption of error budget relative to baseline
-
MTTD: Time from first error to alarm triggered (target: < 5 minutes)
-
SLO Compliance Rate: % of 30-day windows in which the SLO was maintained