WAF-OPS-070 – Post-Incident Review Process

Pillar: Operational Excellence | Severity: Medium | Kategorie: Incident-Learning | Automatisierbar: Niedrig

Beschreibung

Jeder Produktions-Incident mit Nutzerauswirkung oder SLO-Verletzung MUSS innerhalb von 5 Arbeitstagen ein blameless Postmortem auslösen. Postmortems MÜSSEN dokumentierte Action Items mit Owner und Due Date produzieren, die bis zum Abschluss verfolgt werden. Postmortem-Erkenntnisse MÜSSEN teamübergreifend geteilt werden.

Rationale

Ohne systematisches Post-Incident-Learning wiederholen Organisationen dieselben Incident-Klassen. Postmortems konvertieren Betriebsfehler in organisationales Wissen. Blameless Culture ist kritisch: Blame unterdrückt Informationen und verhindert Lernen. Teams mit reifen Postmortem-Prozessen reduzieren Repeat Incidents um 30–50% innerhalb von 12 Monaten.

Bedrohungskontext

Risiko	Beschreibung
Repeat Incidents	Ohne Postmortems werden Root Causes nicht behoben; derselbe Incident wiederholt sich.
Blame-Kultur	Schuld-fokussierte Reviews unterdrücken Information; zukünftige Incidents werden versteckt.
Unverfolgtes Action-Item	Action Items ohne Owner und Tracking werden niemals abgeschlossen.
Isoliertes Lernen	Team A lernt aus Incident; Team B erlebt denselben Incident 3 Monate später.

Risiko

Beschreibung

Repeat Incidents

Ohne Postmortems werden Root Causes nicht behoben; derselbe Incident wiederholt sich.

Blame-Kultur

Schuld-fokussierte Reviews unterdrücken Information; zukünftige Incidents werden versteckt.

Unverfolgtes Action-Item

Action Items ohne Owner und Tracking werden niemals abgeschlossen.

Isoliertes Lernen

Team A lernt aus Incident; Team B erlebt denselben Incident 3 Monate später.

Anforderung

Alle SEV-1/P1 Incidents und SLO-Verletzungen MÜSSEN ein Postmortem auslösen
Postmortem-Dokument MUSS innerhalb von 5 Arbeitstagen fertiggestellt sein
Postmortems MÜSSEN blameless sein: Fokus auf Systeme, nicht Personen
Alle Action Items MÜSSEN Owner, Due Date und Tracking in JIRA/GitHub Issues haben
Postmortems MÜSSEN teamübergreifend geteilt werden (Slack-Channel, Wiki)

Implementierungsanleitung

Trigger-Kriterien definieren: SEV-1/P1 Incidents, alle SLO-Verletzungen, Datenverlust-Events, Deployment-Rollbacks
Postmortem-Template erstellen: Title, Date, Severity, Impact, Timeline, Root Cause, Contributing Factors, Action Items
Blameless-Kultur einführen: Schulung, Führungskräfte als Vorbilder, psychologische Sicherheit als Voraussetzung
Meeting innerhalb 48h: Timeline-Rekonstruktion gemeinsam; Action Items im Meeting zuweisen
Sharing-Prozess: Postmortem in Slack #postmortems und Wiki innerhalb 1 Woche
Monatliche Trend-Analyse: Häufigste Incident-Kategorien; Action-Item-Completion-Rate tracken

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Kein Postmortem-Prozess	Incidents gelöst und vergessen. Keine Timeline-Dokumentation. Blame-Kultur oder keine Kultur.
2	Informelle Reviews	Große Incidents informell diskutiert. Kein Template. Action Items nicht verfolgt.
3	Strukturiert & Blameless	Template für alle qualifying Incidents. Blameless. Action Items in JIRA. Innerhalb 5 Tage.
4	Systemische Analyse	Monatliche Trend-Analyse. Action-Item-Completion >= 80%. Teamübergreifendes Sharing.
5	Organisationales Lernen	Postmortem-Datenbank durchsuchbar. Repeat-Incident-Rate sinkend YoY. In Onboarding integriert.

Level

Bezeichnung

Kriterien

Kein Postmortem-Prozess

Incidents gelöst und vergessen. Keine Timeline-Dokumentation. Blame-Kultur oder keine Kultur.

Informelle Reviews

Große Incidents informell diskutiert. Kein Template. Action Items nicht verfolgt.

Strukturiert & Blameless

Template für alle qualifying Incidents. Blameless. Action Items in JIRA. Innerhalb 5 Tage.

Systemische Analyse

Monatliche Trend-Analyse. Action-Item-Completion >= 80%. Teamübergreifendes Sharing.

Organisationales Lernen

Postmortem-Datenbank durchsuchbar. Repeat-Incident-Rate sinkend YoY. In Onboarding integriert.

Terraform Checks

waf-ops-070.tf.aws.incident-management-sns-topic

Prüft: SNS-Topic für Incident-Benachrichtigungen existiert und ist konfiguriert.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_sns_topic" "incidents" { name = "production-incident-notifications" } resource "aws_sns_topic_subscription" "pagerduty" { topic_arn = aws_sns_topic.incidents.arn protocol = "https" endpoint = var.pagerduty_endpoint } resource "aws_cloudwatch_metric_alarm" "critical" { alarm_actions = [aws_sns_topic.incidents.arn] /* ... */ }`	`resource "aws_cloudwatch_metric_alarm" "critical" { alarm_name = "payment-critical" # Keine alarm_actions # Incidents werden nicht gemeldet # WAF-OPS-070 Violation }`

resource "aws_sns_topic" "incidents" {
  name = "production-incident-notifications"
}
resource "aws_sns_topic_subscription" "pagerduty" {
  topic_arn = aws_sns_topic.incidents.arn
  protocol  = "https"
  endpoint  = var.pagerduty_endpoint
}
resource "aws_cloudwatch_metric_alarm" "critical" {
  alarm_actions = [aws_sns_topic.incidents.arn]
  /* ... */
}

resource "aws_cloudwatch_metric_alarm" "critical" {
  alarm_name = "payment-critical"
  # Keine alarm_actions
  # Incidents werden nicht gemeldet
  # WAF-OPS-070 Violation
}

waf-ops-070.tf.azurerm.action-group-incident

Prüft: Azure Monitor Action Group mit konfiguriertem Empfänger für Incident-Routing.

# Compliant
resource "azurerm_monitor_action_group" "oncall" {
  name                = "production-oncall"
  resource_group_name = azurerm_resource_group.main.name
  short_name          = "oncall"
  email_receiver {
    name          = "oncall-engineer"
    email_address = var.oncall_email
  }
}

Remediation: SNS-Topic mit PagerDuty/OpsGenie-Subscription erstellen. CloudWatch Alarms müssen alarm_actions mit diesem SNS-Topic referenzieren.

Evidenz

Typ	Pflicht	Beschreibung
Process	✅ Pflicht	Post-Incident-Review-Policy (Trigger, Timeline, Template, Publikationsanforderungen).
Governance	✅ Pflicht	Postmortem-Archiv der letzten 3 Monate mit Action Items und Completion-Status.
Process	Optional	Monatlicher Incident-Trend-Report (Kategorien, Frequenz, Action-Item-Completion-Rate).
Governance	Optional	Team-OKR oder KPI für Repeat-Incident-Reduktion.

Typ

Pflicht

Beschreibung

Process

✅ Pflicht

Post-Incident-Review-Policy (Trigger, Timeline, Template, Publikationsanforderungen).

Governance

✅ Pflicht

Postmortem-Archiv der letzten 3 Monate mit Action Items und Completion-Status.

Process

Optional

Monatlicher Incident-Trend-Report (Kategorien, Frequenz, Action-Item-Completion-Rate).

Governance

Optional

Team-OKR oder KPI für Repeat-Incident-Reduktion.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 20000-1:2018	8.2.3 – Change management; 8.3.4 – Release management; 10.2.2 – Financial management
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain
AWS Well-Architected Framework	Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy
DORA	DORA 2024 – Technical practices; DORA 2024 – Organizational culture
SOC 2 Type II	CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring
Google SRE Book	Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives
PCI DSS v4.0	Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices
FinOps Foundation	Core Module – Financial accountability; Management Layer – Cost governance
BSI C5:2020	OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity
NIST SP 800-53	CM-1 – Configuration management policy; CM-2 – Configuration management
NIST CSF 2.0	GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring
TISAX	Information security – Change management
ANSSI SecNumCloud	Domain – Change management
BIO	BIO – Veranderingenbeheer
ENS High	op.exp.6 – Gestión de cambios
UK NCSC CAF	A4 – Policy and assurance; A5 – Continual improvement
CMMC 2.0	CM.L2-3.4.1 – Establish baseline configurations
IRAP	ISM – Change management
CCCS PBMM	CM-2 – Baseline configuration; CA-7 – Continuous monitoring
MAS TRM	Ch.3 – Technology risk governance; Ch.9 – Change management
ISMAP	Operational excellence and continuous improvement
FISC	Operational measures – Change management

Framework

Controls

ISO/IEC 20000-1:2018

8.2.3 – Change management; 8.3.4 – Release management; 10.2.2 – Financial management

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain

AWS Well-Architected Framework

Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy

DORA

DORA 2024 – Technical practices; DORA 2024 – Organizational culture

SOC 2 Type II

CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring

Google SRE Book

Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives

PCI DSS v4.0

Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices

FinOps Foundation

Core Module – Financial accountability; Management Layer – Cost governance

BSI C5:2020

OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity

NIST SP 800-53

CM-1 – Configuration management policy; CM-2 – Configuration management

NIST CSF 2.0

GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring

TISAX

Information security – Change management

ANSSI SecNumCloud

Domain – Change management

BIO

BIO – Veranderingenbeheer

ENS High

op.exp.6 – Gestión de cambios

UK NCSC CAF

A4 – Policy and assurance; A5 – Continual improvement

CMMC 2.0

CM.L2-3.4.1 – Establish baseline configurations

IRAP

ISM – Change management

CCCS PBMM

CM-2 – Baseline configuration; CA-7 – Continuous monitoring

MAS TRM

Ch.3 – Technology risk governance; Ch.9 – Change management

ISMAP

Operational excellence and continuous improvement

FISC

Operational measures – Change management

WAF-OPS-070 – Post-Incident Review Process

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-ops-070.tf.aws.incident-management-sns-topic

waf-ops-070.tf.azurerm.action-group-incident

Evidenz

Regulatorisches Mapping

Verwandte Controls