WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Defining and Measuring SLOs & SLAs

Context

Service Level Objectives (SLOs) are the measurable foundation of every reliability strategy. Without clear SLOs, a team cannot decide whether too much or too little is being invested in reliability. Without error budgets, the operational framework for velocity-vs-stability decisions is missing.

Typical problems without a structured SLO practice:

  • Incidents are dismissed as "normal" because no shared understanding of "unacceptable degradation" exists

  • Reliability investments are prioritized by political pressure, not data

  • SLAs towards customers are supported by estimates, not measurements

  • On-call teams do not know severity thresholds for escalation

  • WAF-REL-010 – SLA & SLO Definition Documented

  • WAF-REL-100 – Reliability Debt Register & Quarterly Review

Target State

A mature SLO program is:

  • Service-specific: Every production service has an SLO document

  • Measurable: SLIs are continuously calculated from real request data

  • Operationally anchored: Error budget burn rate alerts drive release decisions

  • Stakeholder-communicated: SLOs and error budget status regularly visible

Technical Implementation

Step 1: Structure the SLO Document

# docs/slos/payment-api-slo.yml
service: "payment-api"
version: "1.2"
effective_date: "2026-01-01"
owner: "payments-team"
reviewed_by: "architecture-board"

slos:
  availability:
    description: "Percentage of requests returning 2xx or 3xx responses"
    sli: "sum(rate(http_requests_total{status=~'2..|3..'}[5m])) / sum(rate(http_requests_total[5m]))"
    target: 0.999       # 99.9%
    window: "30d"
    error_budget_minutes: 43.2

  latency_p99:
    description: "99th percentile request latency"
    sli: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
    target_seconds: 0.5  # < 500ms
    coverage: 0.99       # 99% of requests must meet this

  error_rate:
    description: "Fraction of requests resulting in 5xx errors"
    sli: "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
    target: 0.001        # < 0.1% error rate
    window: "30d"

alert_policy:
  burn_rate_fast:
    window: "1h"
    burn_rate: 14.4    # Will exhaust monthly budget in 2 days
    severity: "SEV2"
  burn_rate_slow:
    window: "6h"
    burn_rate: 6       # Will exhaust monthly budget in 5 days
    severity: "SEV3"

Step 2: SLO Monitoring with Prometheus

# prometheus/rules/slo-payment-api.yml
groups:
  - name: slo.payment-api
    interval: 30s
    rules:
      # SLI: Availability ratio
      - record: job:slo_availability:ratio_rate5m
        expr: >
          sum(rate(http_requests_total{job="payment-api",status=~"2..|3.."}[5m]))
          /
          sum(rate(http_requests_total{job="payment-api"}[5m]))

      # Error budget burn rate (fast window)
      - alert: SloBurnRateFast
        expr: >
          (
            (1 - job:slo_availability:ratio_rate5m) / (1 - 0.999)
          ) > 14.4
        for: 5m
        labels:
          severity: critical
          service: payment-api
        annotations:
          summary: "Payment API: Fast error budget burn rate (>14.4x)"
          description: "Current burn rate {{ $value | humanize }}x. Monthly budget exhausted in {{ printf '%.1f' (30 / $value) }} days."
          runbook: "https://wiki.example.com/runbooks/payment-api-slo-burn"

      # Error budget burn rate (slow window)
      - alert: SloBurnRateSlow
        expr: >
          (
            avg_over_time((1 - job:slo_availability:ratio_rate5m)[6h:5m])
            / (1 - 0.999)
          ) > 6
        for: 15m
        labels:
          severity: warning
          service: payment-api
        annotations:
          summary: "Payment API: Slow error budget burn rate (>6x)"
          runbook: "https://wiki.example.com/runbooks/payment-api-slo-burn"

Step 3: CloudWatch SLO Monitoring (AWS)

# terraform/monitoring/slo-alarms.tf

resource "aws_cloudwatch_metric_alarm" "slo_error_rate_fast_burn" {
  alarm_name          = "slo-payment-api-fast-burn"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  datapoints_to_alarm = 4

  metric_query {
    id          = "error_rate"
    return_data = true
    metric {
      metric_name = "5XXError"
      namespace   = "AWS/ApiGateway"
      period      = 60
      stat        = "Sum"
      dimensions = {
        ApiName = "payment-api"
        Stage   = "production"
      }
    }
  }

  threshold          = 10
  alarm_description  = "SLO: Fast burn rate – Payment API 5XX errors. Runbook: https://wiki/runbooks/payment-api-slo"
  alarm_actions      = [aws_sns_topic.oncall_critical.arn]
  ok_actions         = [aws_sns_topic.oncall_critical.arn]
  treat_missing_data = "notBreaching"

  tags = var.mandatory_tags
}

Step 4: Grafana SLO Dashboard

{
  "dashboard": {
    "title": "Payment API – SLO Dashboard",
    "panels": [
      {
        "title": "Availability (30d Rolling)",
        "type": "stat",
        "targets": [{
          "expr": "avg_over_time(job:slo_availability:ratio_rate5m[30d]) * 100",
          "legendFormat": "Availability %"
        }],
        "thresholds": {"mode": "absolute", "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 99.5},
          {"color": "green", "value": 99.9}
        ]}
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "targets": [{
          "expr": "(1 - (1 - avg_over_time(job:slo_availability:ratio_rate5m[30d])) / (1 - 0.999)) * 100"
        }]
      }
    ]
  }
}

Typical Anti-Patterns

  • SLO set too ambitiously: 99.99% SLO without corresponding infrastructure leads to a permanently empty error budget

  • SLI measures the wrong metric: Load balancer healthcheck as SLI instead of real user journey

  • No error budget policy: SLO defined but no operational consequences when budget is exhausted

  • SLO never reviewed: SLOs that remain unchanged after a year usually do not reflect current reality

Metrics

  • Error Budget Remaining: % of monthly budget still available (target: > 50% at mid-month)

  • Burn Rate: Current consumption of error budget relative to baseline

  • MTTD: Time from first error to alarm triggered (target: < 5 minutes)

  • SLO Compliance Rate: % of 30-day windows in which the SLO was maintained

Maturity Level

Level 1 – No SLOs, reactive incident response
Level 2 – SLOs documented, no monitoring
Level 3 – SLOs monitored with error budget alerts
Level 4 – Error budget policy, release freeze when budget is exhausted
Level 5 – Adaptive SLOs, customer dashboards, predictive alerting