WAF-REL-010 – SLA & SLO Definition Documented
Description
Every production workload MUST have documented Service Level Objectives (SLOs) for availability, latency and error rate. SLOs MUST be monitored in monitoring dashboards with alerting on error budget burn rate. Service Level Agreements (SLAs) MUST reference SLOs.
Without SLOs, reliability is not measurable. All other WAF-REL controls assume that goals have been defined against which measurement is possible.
Rationale
SLOs transform reliability from a subjective perception into a measurable, controllable discipline. Error budgets derive how much risk is still tolerable and enable data-driven decisions about release velocity vs. stability. Without SLOs, teams make reliability decisions based on gut feeling and political pressure – not a sustainable approach.
Threat Context
| Risk | Description |
|---|---|
Unmeasurable Degradation |
Without an SLO it is unclear when a system is considered degraded; incidents are detected too late. |
Missing Error Budget |
Without an error budget the operational framework for velocity-vs-stability decisions is absent. |
SLA Without Foundation |
External SLAs not based on measured SLOs are promises without evidence. |
No Escalation Thresholds |
On-call teams cannot consistently classify severity without defined thresholds. |
Requirement
Every production workload MUST:
-
Document availability SLO (%), latency SLO (p99 ms) and error rate SLO (%)
-
Define a measurement window (typically 30 days rolling)
-
Calculate and automatically track error budget
-
Configure multi-window burn rate alerts (fast burn: 1h, slow burn: 6h)
-
Keep the SLO document versioned in a code repository
-
Conduct and document a quarterly SLO review
Implementation Guidance
-
Create SLO document: YAML or Markdown, version-controlled, with availability, latency, error rate
-
Instrument SLIs: Prometheus metrics or CloudWatch alarms for all SLIs
-
Calculate error budget:
(1 - SLO_target) * measurement_window_seconds -
Multi-window alerts: Fast Burn (1h, 14.4x) + Slow Burn (6h, 6x)
-
Create dashboard: Grafana or native CloudWatch Dashboard with SLO Compliance + Error Budget
-
Reference SLA: Link external SLAs to the SLO document
-
Review calendar: Quarterly review in the team calendar as a fixed meeting
Maturity Levels
| Level | Name | Criteria |
|---|---|---|
1 |
No SLOs |
No goals defined; incidents treated reactively. |
2 |
SLOs Documented |
SLO document exists; no automatic monitoring. |
3 |
SLOs Monitored |
SLIs instrumented; error budget burn rate alerts configured; quarterly review. |
4 |
Error Budget Policy Active |
Deployments paused when budget is exhausted; multi-window alerts. |
5 |
Adaptive SLOs |
Automatically adjusted SLOs; customer dashboards; predictive alerts. |
Terraform Checks
waf-rel-010.tf.aws.cloudwatch-slo-alarm
Checks: CloudWatch alarm configured for SLO monitoring with alarm_actions and threshold.
| Compliant | Non-Compliant |
|---|---|
|
|
Remediation: Set alarm_actions and ok_actions to an SNS topic connected to
the on-call system (PagerDuty/OpsGenie).
Evidence
| Type | Required | Description |
|---|---|---|
Governance |
✅ Required |
SLO document per workload (versioned): availability, latency, error rate, measurement window. |
Config |
✅ Required |
Monitoring dashboard with SLO compliance and error budget burn rate in real time. |
Governance |
Optional |
SLA contract with reference to SLO document and escalation clauses. |
Process |
Optional |
Quarterly SLO review minutes with history of adjustments. |