WAF-REL-060 – Incident Response & Runbook Readiness
Description
All production workloads MUST have a documented Incident Response (IR) plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST exist for all critical alerts and be linked directly from alert notifications. Post-incident reviews MUST be conducted for SEV1/SEV2 within 5 business days.
Rationale
Without a defined IR process, MTTR rises dramatically because on-call engineers under pressure reconstruct steps that have already been documented. Runbooks encode institutional knowledge and enable consistent incident resolution regardless of the engineer on duty. Post-mortems prevent recurrence through structured root cause analysis.
Threat Context
| Risk | Description |
|---|---|
Extended MTTR |
Without runbooks, engineers spend valuable minutes diagnosing instead of resolving. |
Knowledge Loss |
Key engineer on vacation; no backup has context knowledge for critical service. |
Inconsistent Severity |
Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation. |
Incident Recurrence |
The same root cause causes a third incident; no post-mortem action items tracked. |
Requirement
-
4 severity levels (SEV1–SEV4) with objective, measurable criteria
-
On-call rotation with primary and secondary contact configured
-
Runbooks for all critical alerts; linked directly from the alert body
-
MTTR, MTTD and incident frequency tracked as metrics
-
Post-incident reviews for SEV1/SEV2 within 5 business days
-
Action items from post-mortems tracked to target date
Implementation Guidance
-
Severity definitions: YAML document with measurable criteria (error rate %, user impact %)
-
Set up on-call: PagerDuty/OpsGenie with primary and secondary rotation
-
Top-5 runbooks: Write runbook for the top 5 alerts per service
-
Alert description:
alarm_descriptioncontains runbook URL and severity -
MTTR dashboard: Make incident metrics visible in Grafana or native tool
-
Post-mortem culture: Introduce blameless post-mortem template; mandatory for SEV1/SEV2
Maturity Levels
| Level | Name | Criteria |
|---|---|---|
1 |
Ad-hoc |
No defined process; incidents handled by whoever is available. |
2 |
Process Documented |
Severity and escalation defined; on-call configured; basic runbooks available. |
3 |
Runbooks Linked, MTTR Tracked |
All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2. |
4 |
Automated Triage |
Automated diagnostic data on alert; runbook steps partially automated. |
5 |
Self-Healing |
AIOps incident correlation; MTTR < 5 minutes for known error classes. |
Terraform Checks
waf-rel-060.tf.aws.sns-topic-alarm-action
Checks: CloudWatch alarms have alarm_actions and ok_actions for on-call notification.
| Compliant | Non-Compliant |
|---|---|
|
|
Remediation: Point alarm_actions and ok_actions to the SNS topic.
Populate alarm_description with runbook URL and severity classification.
Evidence
| Type | Required | Description |
|---|---|---|
Governance |
✅ Required |
Incident response plan with severity, escalation paths and on-call structure. |
Process |
✅ Required |
Post-incident review records for all SEV1/SEV2 incidents in the last 12 months. |
Config |
Optional |
On-call schedule in PagerDuty/OpsGenie with current rotation. |
Governance |
Optional |
Runbook catalog with links to all critical alert runbooks. |
Regulatorisches Mapping
| Framework | Controls |
|---|---|
ISO/IEC 27001:2022 |
A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents |
ITIL 4 |
SVS – Service value system; DP – Design principle; OV – Operation value chain |
AWS Well-Architected Framework |
Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor |
SRE Book (Google) |
Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring |
CNCF Cloud Native Security |
SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials |
BSI C5:2022 |
SIM-01 – Security incident management; SIM-02 – Security information and event management |
GDPR |
Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach |
NIST SP 800-161 |
SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls |
DORA |
Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools |
COBIT 2019 |
DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity |
TISAX |
Information security – Incident response |
ANSSI SecNumCloud |
Domain – Incident response; Domain – Business continuity |
BIO |
BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit |
ENS High |
op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio |
UK NCSC CAF |
D1 – Response and recovery planning; D2 – Lessons learned |
CMMC 2.0 |
IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents |
IRAP |
ISM – Incident management; ISM – Business continuity |
CCCS PBMM |
IR-4 – Incident handling; IR-8 – Incident response plan |
MAS TRM |
Ch.10 – Security incident management; Ch.11 – Business continuity |
ISMAP |
Reliability and incident management |
FISC |
Operational measures – Incident response |