WAF++ WAF++
Back to WAF++ Homepage

WAF-REL-060 – Incident Response & Runbook Readiness

Description

All production workloads MUST have a documented Incident Response (IR) plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST exist for all critical alerts and be linked directly from alert notifications. Post-incident reviews MUST be conducted for SEV1/SEV2 within 5 business days.

Rationale

Without a defined IR process, MTTR rises dramatically because on-call engineers under pressure reconstruct steps that have already been documented. Runbooks encode institutional knowledge and enable consistent incident resolution regardless of the engineer on duty. Post-mortems prevent recurrence through structured root cause analysis.

Threat Context

Risk Description

Extended MTTR

Without runbooks, engineers spend valuable minutes diagnosing instead of resolving.

Knowledge Loss

Key engineer on vacation; no backup has context knowledge for critical service.

Inconsistent Severity

Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation.

Incident Recurrence

The same root cause causes a third incident; no post-mortem action items tracked.

Requirement

  • 4 severity levels (SEV1–SEV4) with objective, measurable criteria

  • On-call rotation with primary and secondary contact configured

  • Runbooks for all critical alerts; linked directly from the alert body

  • MTTR, MTTD and incident frequency tracked as metrics

  • Post-incident reviews for SEV1/SEV2 within 5 business days

  • Action items from post-mortems tracked to target date

Implementation Guidance

  1. Severity definitions: YAML document with measurable criteria (error rate %, user impact %)

  2. Set up on-call: PagerDuty/OpsGenie with primary and secondary rotation

  3. Top-5 runbooks: Write runbook for the top 5 alerts per service

  4. Alert description: alarm_description contains runbook URL and severity

  5. MTTR dashboard: Make incident metrics visible in Grafana or native tool

  6. Post-mortem culture: Introduce blameless post-mortem template; mandatory for SEV1/SEV2

Maturity Levels

Level Name Criteria

1

Ad-hoc

No defined process; incidents handled by whoever is available.

2

Process Documented

Severity and escalation defined; on-call configured; basic runbooks available.

3

Runbooks Linked, MTTR Tracked

All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2.

4

Automated Triage

Automated diagnostic data on alert; runbook steps partially automated.

5

Self-Healing

AIOps incident correlation; MTTR < 5 minutes for known error classes.

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Checks: CloudWatch alarms have alarm_actions and ok_actions for on-call notification.

Compliant Non-Compliant
resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
  alarm_description = jsonencode({
    runbook = "https://wiki/rb/payment"
    severity = "SEV2"
  })
}
resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  # No alarm_actions –
  # alert fires silently
  # WAF-REL-060 Violation
}

Remediation: Point alarm_actions and ok_actions to the SNS topic. Populate alarm_description with runbook URL and severity classification.

Evidence

Type Required Description

Governance

✅ Required

Incident response plan with severity, escalation paths and on-call structure.

Process

✅ Required

Post-incident review records for all SEV1/SEV2 incidents in the last 12 months.

Config

Optional

On-call schedule in PagerDuty/OpsGenie with current rotation.

Governance

Optional

Runbook catalog with links to all critical alert runbooks.