WAF-REL-010 – SLA & SLO Definition Documented

Pillar: Reliability | Severity: Critical | Kategorie: Reliability Governance | Automatisierbar: Mittel

Beschreibung

Jeder Produktions-Workload MUSS dokumentierte Service Level Objectives (SLOs) für Availability, Latenz und Fehlerrate haben. SLOs MÜSSEN in Monitoring-Dashboards mit Alerting auf Error Budget Burn Rate überwacht werden. Service Level Agreements (SLAs) MÜSSEN SLOs referenzieren.

Ohne SLOs ist Reliability nicht messbar. Alle anderen WAF-REL Controls setzen voraus, dass Ziele definiert wurden, gegen die gemessen werden kann.

Rationale

SLOs transformieren Reliability von einer subjektiven Wahrnehmung in eine messbare, steuerbare Disziplin. Error Budgets leiten ab, wie viel Risiko noch tolerierbar ist und ermöglichen datengetriebene Entscheidungen über Release-Velocity vs. Stabilität. Ohne SLOs treffen Teams Reliability-Entscheidungen auf Basis von Bauchgefühl und politischem Druck – kein nachhaltiger Ansatz.

Bedrohungskontext

Risiko	Beschreibung
Unmessbare Degradation	Ohne SLO ist unklar, ab wann ein System als degradiert gilt; Incidents werden zu spät erkannt.
Fehlendes Error Budget	Ohne Error Budget fehlt der operative Rahmen für Velocity-vs-Stability-Entscheidungen.
SLA ohne Fundament	Externe SLAs, die nicht auf gemessenen SLOs basieren, sind Versprechen ohne Evidenz.
Keine Eskalationsschwellen	On-Call-Teams können Severity ohne definierte Schwellenwerte nicht konsistent einordnen.

Risiko

Beschreibung

Unmessbare Degradation

Ohne SLO ist unklar, ab wann ein System als degradiert gilt; Incidents werden zu spät erkannt.

Fehlendes Error Budget

Ohne Error Budget fehlt der operative Rahmen für Velocity-vs-Stability-Entscheidungen.

SLA ohne Fundament

Externe SLAs, die nicht auf gemessenen SLOs basieren, sind Versprechen ohne Evidenz.

Keine Eskalationsschwellen

On-Call-Teams können Severity ohne definierte Schwellenwerte nicht konsistent einordnen.

Anforderung

Jeder Produktions-Workload MUSS:

Availability-SLO (%), Latenz-SLO (p99 ms) und Fehlerrate-SLO (%) dokumentieren
Messfenster (typisch 30 Tage rolling) definieren
Error Budget berechnen und automatisch tracken
Multi-Window Burn Rate Alerts konfigurieren (fast burn: 1h, slow burn: 6h)
SLO-Dokument versioniert in einem Code-Repository halten
Quarterly SLO-Review durchführen und dokumentieren

Implementierungsanleitung

SLO-Dokument erstellen: YAML oder Markdown, version-controlled, mit Availability, Latenz, Fehlerrate
SLIs instrumentieren: Prometheus-Metriken oder CloudWatch-Alarms für alle SLIs
Error Budget berechnen: (1 - SLO_target) * measurement_window_seconds
Multi-Window Alerts: Fast Burn (1h, 14.4x) + Slow Burn (6h, 6x)
Dashboard erstellen: Grafana oder native CloudWatch Dashboard mit SLO Compliance + Error Budget
SLA referenzieren: Externe SLAs auf SLO-Dokument verlinken
Review-Kalender: Quarterly Review im Team-Kalender als festes Meeting eintragen

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Keine SLOs	Keine Ziele definiert; Incidents reaktiv behandelt.
2	SLOs dokumentiert	SLO-Dokument vorhanden; kein automatisches Monitoring.
3	SLOs überwacht	SLIs instrumentiert; Error Budget Burn Rate Alerts konfiguriert; quarterly Review.
4	Error Budget Policy aktiv	Deployments werden bei Budget-Erschöpfung pausiert; Multi-Window-Alerts.
5	Adaptive SLOs	Automatisch angepasste SLOs; Customer-Dashboards; prädiktive Alerts.

Level

Bezeichnung

Kriterien

Keine SLOs

Keine Ziele definiert; Incidents reaktiv behandelt.

SLOs dokumentiert

SLO-Dokument vorhanden; kein automatisches Monitoring.

SLOs überwacht

SLIs instrumentiert; Error Budget Burn Rate Alerts konfiguriert; quarterly Review.

Error Budget Policy aktiv

Deployments werden bei Budget-Erschöpfung pausiert; Multi-Window-Alerts.

Adaptive SLOs

Automatisch angepasste SLOs; Customer-Dashboards; prädiktive Alerts.

Terraform Checks

waf-rel-010.tf.aws.cloudwatch-slo-alarm

Prüft: CloudWatch-Alarm für SLO-Monitoring konfiguriert mit alarm_actions und threshold.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "slo_error_rate" { alarm_name = "slo-payment-svc" comparison_operator = "GreaterThanThreshold" evaluation_periods = 5 metric_name = "5XXError" namespace = "AWS/ApiGateway" period = 60 statistic = "Sum" threshold = 10 alarm_actions = [aws_sns_topic.oncall.arn] ok_actions = [aws_sns_topic.oncall.arn] }`	`resource "aws_cloudwatch_metric_alarm" "errors" { alarm_name = "errors" comparison_operator = "GreaterThanThreshold" evaluation_periods = 1 metric_name = "Errors" namespace = "AWS/Lambda" period = 300 statistic = "Sum" threshold = 100 # Kein alarm_actions – # Alert feuert lautlos }`

resource "aws_cloudwatch_metric_alarm"
    "slo_error_rate" {
  alarm_name          = "slo-payment-svc"
  comparison_operator =
    "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "5XXError"
  namespace           = "AWS/ApiGateway"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
}

resource "aws_cloudwatch_metric_alarm"
    "errors" {
  alarm_name          = "errors"
  comparison_operator =
    "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  period              = 300
  statistic           = "Sum"
  threshold           = 100
  # Kein alarm_actions –
  # Alert feuert lautlos
}

Remediation: alarm_actions und ok_actions auf SNS-Topic setzen, das mit On-Call-System (PagerDuty/OpsGenie) verbunden ist.

Evidenz

Typ	Pflicht	Beschreibung
Governance	✅ Pflicht	SLO-Dokument pro Workload (versioniert): Availability, Latenz, Fehlerrate, Messfenster.
Config	✅ Pflicht	Monitoring-Dashboard mit SLO-Compliance und Error Budget Burn Rate in Echtzeit.
Governance	Optional	SLA-Vertrag mit Verweis auf SLO-Dokument und Eskalationsklauseln.
Process	Optional	Quarterly SLO-Review Protokolle mit Anpassungshistorie.

Typ

Pflicht

Beschreibung

Governance

✅ Pflicht

SLO-Dokument pro Workload (versioniert): Availability, Latenz, Fehlerrate, Messfenster.

Config

✅ Pflicht

Monitoring-Dashboard mit SLO-Compliance und Error Budget Burn Rate in Echtzeit.

Governance

Optional

SLA-Vertrag mit Verweis auf SLO-Dokument und Eskalationsklauseln.

Process

Optional

Quarterly SLO-Review Protokolle mit Anpassungshistorie.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

SLO & SLA definieren und messen →

WAF-REL-010 – SLA & SLO Definition Documented

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-010.tf.aws.cloudwatch-slo-alarm

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice