WAF-REL-060 – Incident Response & Runbook Readiness

Pillar: Reliability | Severity: High | Category: Incident Response | Automatable: Medium

Description

All production workloads MUST have a documented Incident Response (IR) plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST exist for all critical alerts and be linked directly from alert notifications. Post-incident reviews MUST be conducted for SEV1/SEV2 within 5 business days.

Rationale

Without a defined IR process, MTTR rises dramatically because on-call engineers under pressure reconstruct steps that have already been documented. Runbooks encode institutional knowledge and enable consistent incident resolution regardless of the engineer on duty. Post-mortems prevent recurrence through structured root cause analysis.

Threat Context

Risk	Description
Extended MTTR	Without runbooks, engineers spend valuable minutes diagnosing instead of resolving.
Knowledge Loss	Key engineer on vacation; no backup has context knowledge for critical service.
Inconsistent Severity	Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation.
Incident Recurrence	The same root cause causes a third incident; no post-mortem action items tracked.

Risk

Description

Extended MTTR

Without runbooks, engineers spend valuable minutes diagnosing instead of resolving.

Knowledge Loss

Key engineer on vacation; no backup has context knowledge for critical service.

Inconsistent Severity

Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation.

Incident Recurrence

The same root cause causes a third incident; no post-mortem action items tracked.

Requirement

4 severity levels (SEV1–SEV4) with objective, measurable criteria
On-call rotation with primary and secondary contact configured
Runbooks for all critical alerts; linked directly from the alert body
MTTR, MTTD and incident frequency tracked as metrics
Post-incident reviews for SEV1/SEV2 within 5 business days
Action items from post-mortems tracked to target date

Implementation Guidance

Severity definitions: YAML document with measurable criteria (error rate %, user impact %)
Set up on-call: PagerDuty/OpsGenie with primary and secondary rotation
Top-5 runbooks: Write runbook for the top 5 alerts per service
Alert description: alarm_description contains runbook URL and severity
MTTR dashboard: Make incident metrics visible in Grafana or native tool
Post-mortem culture: Introduce blameless post-mortem template; mandatory for SEV1/SEV2

Maturity Levels

Level	Name	Criteria
1	Ad-hoc	No defined process; incidents handled by whoever is available.
2	Process Documented	Severity and escalation defined; on-call configured; basic runbooks available.
3	Runbooks Linked, MTTR Tracked	All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2.
4	Automated Triage	Automated diagnostic data on alert; runbook steps partially automated.
5	Self-Healing	AIOps incident correlation; MTTR < 5 minutes for known error classes.

Level

Name

Criteria

Ad-hoc

No defined process; incidents handled by whoever is available.

Process Documented

Severity and escalation defined; on-call configured; basic runbooks available.

Runbooks Linked, MTTR Tracked

All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2.

Automated Triage

Automated diagnostic data on alert; runbook steps partially automated.

Self-Healing

AIOps incident correlation; MTTR < 5 minutes for known error classes.

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Checks: CloudWatch alarms have alarm_actions and ok_actions for on-call notification.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "api_errors" { alarm_name = "payment-api-errors" # ... metric config ... alarm_actions = [aws_sns_topic.oncall.arn] ok_actions = [aws_sns_topic.oncall.arn] alarm_description = jsonencode({ runbook = "https://wiki/rb/payment" severity = "SEV2" }) }`	`resource "aws_cloudwatch_metric_alarm" "api_errors" { alarm_name = "payment-api-errors" # ... metric config ... # No alarm_actions – # alert fires silently # WAF-REL-060 Violation }`

resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
  alarm_description = jsonencode({
    runbook = "https://wiki/rb/payment"
    severity = "SEV2"
  })
}

resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  # No alarm_actions –
  # alert fires silently
  # WAF-REL-060 Violation
}

Remediation: Point alarm_actions and ok_actions to the SNS topic. Populate alarm_description with runbook URL and severity classification.

Evidence

Type	Required	Description
Governance	✅ Required	Incident response plan with severity, escalation paths and on-call structure.
Process	✅ Required	Post-incident review records for all SEV1/SEV2 incidents in the last 12 months.
Config	Optional	On-call schedule in PagerDuty/OpsGenie with current rotation.
Governance	Optional	Runbook catalog with links to all critical alert runbooks.

Type

Required

Description

Governance

✅ Required

Incident response plan with severity, escalation paths and on-call structure.

Process

✅ Required

Post-incident review records for all SEV1/SEV2 incidents in the last 12 months.

Config

Optional

On-call schedule in PagerDuty/OpsGenie with current rotation.

Governance

Optional

Runbook catalog with links to all critical alert runbooks.

Related Controls

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

WAF-REL-060 – Incident Response & Runbook Readiness

Description

Rationale

Threat Context

Requirement

Implementation Guidance

Maturity Levels

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Evidence

Related Controls

Regulatorisches Mapping

Best Practice