Best Practice: Defining and Measuring SLOs & SLAs

Context

Service Level Objectives (SLOs) are the measurable foundation of every reliability strategy. Without clear SLOs, a team cannot decide whether too much or too little is being invested in reliability. Without error budgets, the operational framework for velocity-vs-stability decisions is missing.

Typical problems without a structured SLO practice:

Incidents are dismissed as "normal" because no shared understanding of "unacceptable degradation" exists
Reliability investments are prioritized by political pressure, not data
SLAs towards customers are supported by estimates, not measurements
On-call teams do not know severity thresholds for escalation

Related Controls

WAF-REL-010 – SLA & SLO Definition Documented
WAF-REL-100 – Reliability Debt Register & Quarterly Review

Target State

A mature SLO program is:

Service-specific: Every production service has an SLO document
Measurable: SLIs are continuously calculated from real request data
Operationally anchored: Error budget burn rate alerts drive release decisions
Stakeholder-communicated: SLOs and error budget status regularly visible

Technical Implementation

Step 1: Structure the SLO Document

# docs/slos/payment-api-slo.yml
service: "payment-api"
version: "1.2"
effective_date: "2026-01-01"
owner: "payments-team"
reviewed_by: "architecture-board"

slos:
  availability:
    description: "Percentage of requests returning 2xx or 3xx responses"
    sli: "sum(rate(http_requests_total{status=~'2..|3..'}[5m])) / sum(rate(http_requests_total[5m]))"
    target: 0.999       # 99.9%
    window: "30d"
    error_budget_minutes: 43.2

  latency_p99:
    description: "99th percentile request latency"
    sli: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
    target_seconds: 0.5  # < 500ms
    coverage: 0.99       # 99% of requests must meet this

  error_rate:
    description: "Fraction of requests resulting in 5xx errors"
    sli: "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
    target: 0.001        # < 0.1% error rate
    window: "30d"

alert_policy:
  burn_rate_fast:
    window: "1h"
    burn_rate: 14.4    # Will exhaust monthly budget in 2 days
    severity: "SEV2"
  burn_rate_slow:
    window: "6h"
    burn_rate: 6       # Will exhaust monthly budget in 5 days
    severity: "SEV3"

Step 2: SLO Monitoring with Prometheus

# prometheus/rules/slo-payment-api.yml
groups:
  - name: slo.payment-api
    interval: 30s
    rules:
      # SLI: Availability ratio
      - record: job:slo_availability:ratio_rate5m
        expr: >
          sum(rate(http_requests_total{job="payment-api",status=~"2..|3.."}[5m]))
          /
          sum(rate(http_requests_total{job="payment-api"}[5m]))

      # Error budget burn rate (fast window)
      - alert: SloBurnRateFast
        expr: >
          (
            (1 - job:slo_availability:ratio_rate5m) / (1 - 0.999)
          ) > 14.4
        for: 5m
        labels:
          severity: critical
          service: payment-api
        annotations:
          summary: "Payment API: Fast error budget burn rate (>14.4x)"
          description: "Current burn rate {{ $value | humanize }}x. Monthly budget exhausted in {{ printf '%.1f' (30 / $value) }} days."
          runbook: "https://wiki.example.com/runbooks/payment-api-slo-burn"

      # Error budget burn rate (slow window)
      - alert: SloBurnRateSlow
        expr: >
          (
            avg_over_time((1 - job:slo_availability:ratio_rate5m)[6h:5m])
            / (1 - 0.999)
          ) > 6
        for: 15m
        labels:
          severity: warning
          service: payment-api
        annotations:
          summary: "Payment API: Slow error budget burn rate (>6x)"
          runbook: "https://wiki.example.com/runbooks/payment-api-slo-burn"

Step 3: CloudWatch SLO Monitoring (AWS)

# terraform/monitoring/slo-alarms.tf

resource "aws_cloudwatch_metric_alarm" "slo_error_rate_fast_burn" {
  alarm_name          = "slo-payment-api-fast-burn"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  datapoints_to_alarm = 4

  metric_query {
    id          = "error_rate"
    return_data = true
    metric {
      metric_name = "5XXError"
      namespace   = "AWS/ApiGateway"
      period      = 60
      stat        = "Sum"
      dimensions = {
        ApiName = "payment-api"
        Stage   = "production"
      }
    }
  }

  threshold          = 10
  alarm_description  = "SLO: Fast burn rate – Payment API 5XX errors. Runbook: https://wiki/runbooks/payment-api-slo"
  alarm_actions      = [aws_sns_topic.oncall_critical.arn]
  ok_actions         = [aws_sns_topic.oncall_critical.arn]
  treat_missing_data = "notBreaching"

  tags = var.mandatory_tags
}

Step 4: Grafana SLO Dashboard

{
  "dashboard": {
    "title": "Payment API – SLO Dashboard",
    "panels": [
      {
        "title": "Availability (30d Rolling)",
        "type": "stat",
        "targets": [{
          "expr": "avg_over_time(job:slo_availability:ratio_rate5m[30d]) * 100",
          "legendFormat": "Availability %"
        }],
        "thresholds": {"mode": "absolute", "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 99.5},
          {"color": "green", "value": 99.9}
        ]}
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "targets": [{
          "expr": "(1 - (1 - avg_over_time(job:slo_availability:ratio_rate5m[30d])) / (1 - 0.999)) * 100"
        }]
      }
    ]
  }
}

Typical Anti-Patterns

SLO set too ambitiously: 99.99% SLO without corresponding infrastructure leads to a permanently empty error budget
SLI measures the wrong metric: Load balancer healthcheck as SLI instead of real user journey
No error budget policy: SLO defined but no operational consequences when budget is exhausted
SLO never reviewed: SLOs that remain unchanged after a year usually do not reflect current reality

Metrics

Error Budget Remaining: % of monthly budget still available (target: > 50% at mid-month)
Burn Rate: Current consumption of error budget relative to baseline
MTTD: Time from first error to alarm triggered (target: < 5 minutes)
SLO Compliance Rate: % of 30-day windows in which the SLO was maintained

Maturity Level

Level 1 – No SLOs, reactive incident response
Level 2 – SLOs documented, no monitoring
Level 3 – SLOs monitored with error budget alerts
Level 4 – Error budget policy, release freeze when budget is exhausted
Level 5 – Adaptive SLOs, customer dashboards, predictive alerting