WAF-REL-010 – SLA & SLO Definition Documented

Pillar: Reliability | Severity: Critical | Category: Reliability Governance | Automatable: Medium

Description

Every production workload MUST have documented Service Level Objectives (SLOs) for availability, latency and error rate. SLOs MUST be monitored in monitoring dashboards with alerting on error budget burn rate. Service Level Agreements (SLAs) MUST reference SLOs.

Without SLOs, reliability is not measurable. All other WAF-REL controls assume that goals have been defined against which measurement is possible.

Rationale

SLOs transform reliability from a subjective perception into a measurable, controllable discipline. Error budgets derive how much risk is still tolerable and enable data-driven decisions about release velocity vs. stability. Without SLOs, teams make reliability decisions based on gut feeling and political pressure – not a sustainable approach.

Threat Context

Risk	Description
Unmeasurable Degradation	Without an SLO it is unclear when a system is considered degraded; incidents are detected too late.
Missing Error Budget	Without an error budget the operational framework for velocity-vs-stability decisions is absent.
SLA Without Foundation	External SLAs not based on measured SLOs are promises without evidence.
No Escalation Thresholds	On-call teams cannot consistently classify severity without defined thresholds.

Risk

Description

Unmeasurable Degradation

Without an SLO it is unclear when a system is considered degraded; incidents are detected too late.

Missing Error Budget

Without an error budget the operational framework for velocity-vs-stability decisions is absent.

SLA Without Foundation

External SLAs not based on measured SLOs are promises without evidence.

No Escalation Thresholds

On-call teams cannot consistently classify severity without defined thresholds.

Requirement

Every production workload MUST:

Document availability SLO (%), latency SLO (p99 ms) and error rate SLO (%)
Define a measurement window (typically 30 days rolling)
Calculate and automatically track error budget
Configure multi-window burn rate alerts (fast burn: 1h, slow burn: 6h)
Keep the SLO document versioned in a code repository
Conduct and document a quarterly SLO review

Implementation Guidance

Create SLO document: YAML or Markdown, version-controlled, with availability, latency, error rate
Instrument SLIs: Prometheus metrics or CloudWatch alarms for all SLIs
Calculate error budget: (1 - SLO_target) * measurement_window_seconds
Multi-window alerts: Fast Burn (1h, 14.4x) + Slow Burn (6h, 6x)
Create dashboard: Grafana or native CloudWatch Dashboard with SLO Compliance + Error Budget
Reference SLA: Link external SLAs to the SLO document
Review calendar: Quarterly review in the team calendar as a fixed meeting

Maturity Levels

Level	Name	Criteria
1	No SLOs	No goals defined; incidents treated reactively.
2	SLOs Documented	SLO document exists; no automatic monitoring.
3	SLOs Monitored	SLIs instrumented; error budget burn rate alerts configured; quarterly review.
4	Error Budget Policy Active	Deployments paused when budget is exhausted; multi-window alerts.
5	Adaptive SLOs	Automatically adjusted SLOs; customer dashboards; predictive alerts.

Level

Name

Criteria

No SLOs

No goals defined; incidents treated reactively.

SLOs Documented

SLO document exists; no automatic monitoring.

SLOs Monitored

SLIs instrumented; error budget burn rate alerts configured; quarterly review.

Error Budget Policy Active

Deployments paused when budget is exhausted; multi-window alerts.

Adaptive SLOs

Automatically adjusted SLOs; customer dashboards; predictive alerts.

Terraform Checks

waf-rel-010.tf.aws.cloudwatch-slo-alarm

Checks: CloudWatch alarm configured for SLO monitoring with alarm_actions and threshold.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "slo_error_rate" { alarm_name = "slo-payment-svc" comparison_operator = "GreaterThanThreshold" evaluation_periods = 5 metric_name = "5XXError" namespace = "AWS/ApiGateway" period = 60 statistic = "Sum" threshold = 10 alarm_actions = [aws_sns_topic.oncall.arn] ok_actions = [aws_sns_topic.oncall.arn] }`	`resource "aws_cloudwatch_metric_alarm" "errors" { alarm_name = "errors" comparison_operator = "GreaterThanThreshold" evaluation_periods = 1 metric_name = "Errors" namespace = "AWS/Lambda" period = 300 statistic = "Sum" threshold = 100 # No alarm_actions – # alert fires silently }`

resource "aws_cloudwatch_metric_alarm"
    "slo_error_rate" {
  alarm_name          = "slo-payment-svc"
  comparison_operator =
    "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "5XXError"
  namespace           = "AWS/ApiGateway"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
}

resource "aws_cloudwatch_metric_alarm"
    "errors" {
  alarm_name          = "errors"
  comparison_operator =
    "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  period              = 300
  statistic           = "Sum"
  threshold           = 100
  # No alarm_actions –
  # alert fires silently
}

Remediation: Set alarm_actions and ok_actions to an SNS topic connected to the on-call system (PagerDuty/OpsGenie).

Evidence

Type	Required	Description
Governance	✅ Required	SLO document per workload (versioned): availability, latency, error rate, measurement window.
Config	✅ Required	Monitoring dashboard with SLO compliance and error budget burn rate in real time.
Governance	Optional	SLA contract with reference to SLO document and escalation clauses.
Process	Optional	Quarterly SLO review minutes with history of adjustments.

Type

Required

Description

Governance

✅ Required

SLO document per workload (versioned): availability, latency, error rate, measurement window.

Config

✅ Required

Monitoring dashboard with SLO compliance and error budget burn rate in real time.

Governance

Optional

SLA contract with reference to SLO document and escalation clauses.

Process

Optional

Quarterly SLO review minutes with history of adjustments.

Related Controls

Best Practice

Defining and Measuring SLOs & SLAs →