WAF++ WAF++
Back to WAF++ Homepage

WAF-REL-060 – Incident Response & Runbook Readiness

Description

All production workloads MUST have a documented Incident Response (IR) plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST exist for all critical alerts and be linked directly from alert notifications. Post-incident reviews MUST be conducted for SEV1/SEV2 within 5 business days.

Rationale

Without a defined IR process, MTTR rises dramatically because on-call engineers under pressure reconstruct steps that have already been documented. Runbooks encode institutional knowledge and enable consistent incident resolution regardless of the engineer on duty. Post-mortems prevent recurrence through structured root cause analysis.

Threat Context

Risk Description

Extended MTTR

Without runbooks, engineers spend valuable minutes diagnosing instead of resolving.

Knowledge Loss

Key engineer on vacation; no backup has context knowledge for critical service.

Inconsistent Severity

Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation.

Incident Recurrence

The same root cause causes a third incident; no post-mortem action items tracked.

Requirement

  • 4 severity levels (SEV1–SEV4) with objective, measurable criteria

  • On-call rotation with primary and secondary contact configured

  • Runbooks for all critical alerts; linked directly from the alert body

  • MTTR, MTTD and incident frequency tracked as metrics

  • Post-incident reviews for SEV1/SEV2 within 5 business days

  • Action items from post-mortems tracked to target date

Implementation Guidance

  1. Severity definitions: YAML document with measurable criteria (error rate %, user impact %)

  2. Set up on-call: PagerDuty/OpsGenie with primary and secondary rotation

  3. Top-5 runbooks: Write runbook for the top 5 alerts per service

  4. Alert description: alarm_description contains runbook URL and severity

  5. MTTR dashboard: Make incident metrics visible in Grafana or native tool

  6. Post-mortem culture: Introduce blameless post-mortem template; mandatory for SEV1/SEV2

Maturity Levels

Level Name Criteria

1

Ad-hoc

No defined process; incidents handled by whoever is available.

2

Process Documented

Severity and escalation defined; on-call configured; basic runbooks available.

3

Runbooks Linked, MTTR Tracked

All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2.

4

Automated Triage

Automated diagnostic data on alert; runbook steps partially automated.

5

Self-Healing

AIOps incident correlation; MTTR < 5 minutes for known error classes.

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Checks: CloudWatch alarms have alarm_actions and ok_actions for on-call notification.

Compliant Non-Compliant
resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
  alarm_description = jsonencode({
    runbook = "https://wiki/rb/payment"
    severity = "SEV2"
  })
}
resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  # No alarm_actions –
  # alert fires silently
  # WAF-REL-060 Violation
}

Remediation: Point alarm_actions and ok_actions to the SNS topic. Populate alarm_description with runbook URL and severity classification.

Evidence

Type Required Description

Governance

✅ Required

Incident response plan with severity, escalation paths and on-call structure.

Process

✅ Required

Post-incident review records for all SEV1/SEV2 incidents in the last 12 months.

Config

Optional

On-call schedule in PagerDuty/OpsGenie with current rotation.

Governance

Optional

Runbook catalog with links to all critical alert runbooks.

Regulatorisches Mapping

Framework Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice