WAF-REL-060 – Incident Response & Runbook Readiness

Pillar: Reliability | Severity: High | Kategorie: Incident Response | Automatisierbar: Mittel

Beschreibung

Alle Produktions-Workloads MÜSSEN einen dokumentierten Incident Response (IR) Plan mit Severity-Definitionen, Eskalationspfaden und On-Call-Rotation haben. Runbooks MÜSSEN für alle kritischen Alerts existieren und direkt aus Alert-Notifications verlinkt sein. Post-Incident Reviews MÜSSEN für SEV1/SEV2 innerhalb von 5 Werktagen durchgeführt werden.

Rationale

Ohne definierten IR-Prozess steigt MTTR dramatisch, weil On-Call-Engineers unter Druck Schritte rekonstruieren, die bereits dokumentiert wurden. Runbooks kodieren institutionelles Wissen und ermöglichen konsistente Incident-Lösung unabhängig vom Dienst habenden Engineer. Post-Mortems verhindern Wiederholungen durch strukturierte Root Cause Analysis.

Bedrohungskontext

Risiko	Beschreibung
Verlängertes MTTR	Ohne Runbooks verbringen Engineers wertvolle Minuten mit Diagnose statt Behebung.
Wissensverlust	Schlüssel-Engineer im Urlaub; kein Backup hat Kontextwissen für kritischen Service.
Inkonsistente Severity	Ohne klare Kriterien wird SEV1 als SEV3 behandelt; keine angemessene Eskalation.
Incident-Wiederholung	Derselbe Root Cause verursacht dritten Incident; keine Post-Mortem Action Items verfolgt.

Risiko

Beschreibung

Verlängertes MTTR

Ohne Runbooks verbringen Engineers wertvolle Minuten mit Diagnose statt Behebung.

Wissensverlust

Schlüssel-Engineer im Urlaub; kein Backup hat Kontextwissen für kritischen Service.

Inkonsistente Severity

Ohne klare Kriterien wird SEV1 als SEV3 behandelt; keine angemessene Eskalation.

Incident-Wiederholung

Derselbe Root Cause verursacht dritten Incident; keine Post-Mortem Action Items verfolgt.

Anforderung

4 Severity-Stufen (SEV1–SEV4) mit objektiven, messbaren Kriterien
On-Call-Rotation mit primärem und sekundärem Kontakt konfiguriert
Runbooks für alle kritischen Alerts; direkt aus Alert-Body verlinkt
MTTR, MTTD und Incident-Frequenz als Metriken getrackt
Post-Incident Reviews für SEV1/SEV2 innerhalb 5 Werktage
Action Items aus Post-Mortems bis Zieldatum nachverfolgt

Implementierungsanleitung

Severity-Definitionen: YAML-Dokument mit messbaren Kriterien (error rate %, user impact %)
On-Call einrichten: PagerDuty/OpsGenie mit primärer und sekundärer Rotation
Top-5-Runbooks: Für die häufigsten 5 Alerts je Service Runbook schreiben
Alert-Beschreibung: alarm_description enthält Runbook-URL und Severity
MTTR-Dashboard: Incident-Metriken in Grafana oder native Tool sichtbar machen
Post-Mortem-Kultur: Blameless Post-Mortem Template einführen; Pflicht für SEV1/SEV2

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Ad-hoc	Kein definierter Prozess; Incidents von verfügbaren Personen behandelt.
2	Prozess dokumentiert	Severity und Eskalation definiert; On-Call konfiguriert; Basis-Runbooks vorhanden.
3	Runbooks verlinkt, MTTR getrackt	Alle Critical Alerts mit Runbook-Link; MTTR monthly reviewed; Post-Mortems für SEV1/SEV2.
4	Automatisierte Triage	Automatische Diagnose-Daten bei Alert; Runbook-Schritte teilweise automatisiert.
5	Self-Healing	AIOps-Incident-Correlation; MTTR < 5 Minuten für bekannte Fehlerklassen.

Level

Bezeichnung

Kriterien

Ad-hoc

Kein definierter Prozess; Incidents von verfügbaren Personen behandelt.

Prozess dokumentiert

Severity und Eskalation definiert; On-Call konfiguriert; Basis-Runbooks vorhanden.

Runbooks verlinkt, MTTR getrackt

Alle Critical Alerts mit Runbook-Link; MTTR monthly reviewed; Post-Mortems für SEV1/SEV2.

Automatisierte Triage

Automatische Diagnose-Daten bei Alert; Runbook-Schritte teilweise automatisiert.

Self-Healing

AIOps-Incident-Correlation; MTTR < 5 Minuten für bekannte Fehlerklassen.

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Prüft: CloudWatch Alarms haben alarm_actions und ok_actions für On-Call-Benachrichtigung.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "api_errors" { alarm_name = "payment-api-errors" # ... metric config ... alarm_actions = [aws_sns_topic.oncall.arn] ok_actions = [aws_sns_topic.oncall.arn] alarm_description = jsonencode({ runbook = "https://wiki/rb/payment" severity = "SEV2" }) }`	`resource "aws_cloudwatch_metric_alarm" "api_errors" { alarm_name = "payment-api-errors" # ... metric config ... # Kein alarm_actions – # Alert feuert lautlos # WAF-REL-060 Violation }`

resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
  alarm_description = jsonencode({
    runbook = "https://wiki/rb/payment"
    severity = "SEV2"
  })
}

resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  # Kein alarm_actions –
  # Alert feuert lautlos
  # WAF-REL-060 Violation
}

Remediation: alarm_actions und ok_actions auf SNS-Topic zeigen lassen. alarm_description mit Runbook-URL und Severity-Klassifizierung befüllen.

Evidenz

Typ	Pflicht	Beschreibung
Governance	✅ Pflicht	Incident Response Plan mit Severity, Eskalationspfaden und On-Call-Struktur.
Process	✅ Pflicht	Post-Incident Review Protokolle für alle SEV1/SEV2 Incidents letzter 12 Monate.
Config	Optional	On-Call-Schedule in PagerDuty/OpsGenie mit aktueller Rotation.
Governance	Optional	Runbook-Katalog mit Links zu allen Critical-Alert-Runbooks.

Typ

Pflicht

Beschreibung

Governance

✅ Pflicht

Incident Response Plan mit Severity, Eskalationspfaden und On-Call-Struktur.

Process

✅ Pflicht

Post-Incident Review Protokolle für alle SEV1/SEV2 Incidents letzter 12 Monate.

Config

Optional

On-Call-Schedule in PagerDuty/OpsGenie mit aktueller Rotation.

Governance

Optional

Runbook-Katalog mit Links zu allen Critical-Alert-Runbooks.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

Incident Response & Runbooks →

WAF-REL-060 – Incident Response & Runbook Readiness

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice