WAF++ WAF++
Back to WAF++ Homepage

WAF-REL-010 – SLA & SLO Definition Documented

Description

Every production workload MUST have documented Service Level Objectives (SLOs) for availability, latency and error rate. SLOs MUST be monitored in monitoring dashboards with alerting on error budget burn rate. Service Level Agreements (SLAs) MUST reference SLOs.

Without SLOs, reliability is not measurable. All other WAF-REL controls assume that goals have been defined against which measurement is possible.

Rationale

SLOs transform reliability from a subjective perception into a measurable, controllable discipline. Error budgets derive how much risk is still tolerable and enable data-driven decisions about release velocity vs. stability. Without SLOs, teams make reliability decisions based on gut feeling and political pressure – not a sustainable approach.

Threat Context

Risk Description

Unmeasurable Degradation

Without an SLO it is unclear when a system is considered degraded; incidents are detected too late.

Missing Error Budget

Without an error budget the operational framework for velocity-vs-stability decisions is absent.

SLA Without Foundation

External SLAs not based on measured SLOs are promises without evidence.

No Escalation Thresholds

On-call teams cannot consistently classify severity without defined thresholds.

Requirement

Every production workload MUST:

  • Document availability SLO (%), latency SLO (p99 ms) and error rate SLO (%)

  • Define a measurement window (typically 30 days rolling)

  • Calculate and automatically track error budget

  • Configure multi-window burn rate alerts (fast burn: 1h, slow burn: 6h)

  • Keep the SLO document versioned in a code repository

  • Conduct and document a quarterly SLO review

Implementation Guidance

  1. Create SLO document: YAML or Markdown, version-controlled, with availability, latency, error rate

  2. Instrument SLIs: Prometheus metrics or CloudWatch alarms for all SLIs

  3. Calculate error budget: (1 - SLO_target) * measurement_window_seconds

  4. Multi-window alerts: Fast Burn (1h, 14.4x) + Slow Burn (6h, 6x)

  5. Create dashboard: Grafana or native CloudWatch Dashboard with SLO Compliance + Error Budget

  6. Reference SLA: Link external SLAs to the SLO document

  7. Review calendar: Quarterly review in the team calendar as a fixed meeting

Maturity Levels

Level Name Criteria

1

No SLOs

No goals defined; incidents treated reactively.

2

SLOs Documented

SLO document exists; no automatic monitoring.

3

SLOs Monitored

SLIs instrumented; error budget burn rate alerts configured; quarterly review.

4

Error Budget Policy Active

Deployments paused when budget is exhausted; multi-window alerts.

5

Adaptive SLOs

Automatically adjusted SLOs; customer dashboards; predictive alerts.

Terraform Checks

waf-rel-010.tf.aws.cloudwatch-slo-alarm

Checks: CloudWatch alarm configured for SLO monitoring with alarm_actions and threshold.

Compliant Non-Compliant
resource "aws_cloudwatch_metric_alarm"
    "slo_error_rate" {
  alarm_name          = "slo-payment-svc"
  comparison_operator =
    "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "5XXError"
  namespace           = "AWS/ApiGateway"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
}
resource "aws_cloudwatch_metric_alarm"
    "errors" {
  alarm_name          = "errors"
  comparison_operator =
    "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  period              = 300
  statistic           = "Sum"
  threshold           = 100
  # No alarm_actions –
  # alert fires silently
}

Remediation: Set alarm_actions and ok_actions to an SNS topic connected to the on-call system (PagerDuty/OpsGenie).

Evidence

Type Required Description

Governance

✅ Required

SLO document per workload (versioned): availability, latency, error rate, measurement window.

Config

✅ Required

Monitoring dashboard with SLO compliance and error budget burn rate in real time.

Governance

Optional

SLA contract with reference to SLO document and escalation clauses.

Process

Optional

Quarterly SLO review minutes with history of adjustments.