WAF-REL-060 – Incident Response & Runbook Readiness

Pillar: Reliability | Severity: High | Category: Incident Response | Automatable: Medium

Description

All production workloads MUST have a documented Incident Response (IR) plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST exist for all critical alerts and be linked directly from alert notifications. Post-incident reviews MUST be conducted for SEV1/SEV2 within 5 business days.

Rationale

Without a defined IR process, MTTR rises dramatically because on-call engineers under pressure reconstruct steps that have already been documented. Runbooks encode institutional knowledge and enable consistent incident resolution regardless of the engineer on duty. Post-mortems prevent recurrence through structured root cause analysis.

Threat Context

Risk	Description
Extended MTTR	Without runbooks, engineers spend valuable minutes diagnosing instead of resolving.
Knowledge Loss	Key engineer on vacation; no backup has context knowledge for critical service.
Inconsistent Severity	Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation.
Incident Recurrence	The same root cause causes a third incident; no post-mortem action items tracked.

Risk

Description

Extended MTTR

Without runbooks, engineers spend valuable minutes diagnosing instead of resolving.

Knowledge Loss

Key engineer on vacation; no backup has context knowledge for critical service.

Inconsistent Severity

Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation.

Incident Recurrence

The same root cause causes a third incident; no post-mortem action items tracked.

Requirement

4 severity levels (SEV1–SEV4) with objective, measurable criteria
On-call rotation with primary and secondary contact configured
Runbooks for all critical alerts; linked directly from the alert body
MTTR, MTTD and incident frequency tracked as metrics
Post-incident reviews for SEV1/SEV2 within 5 business days
Action items from post-mortems tracked to target date

Implementation Guidance

Severity definitions: YAML document with measurable criteria (error rate %, user impact %)
Set up on-call: PagerDuty/OpsGenie with primary and secondary rotation
Top-5 runbooks: Write runbook for the top 5 alerts per service
Alert description: alarm_description contains runbook URL and severity
MTTR dashboard: Make incident metrics visible in Grafana or native tool
Post-mortem culture: Introduce blameless post-mortem template; mandatory for SEV1/SEV2

Maturity Levels

Level	Name	Criteria
1	Ad-hoc	No defined process; incidents handled by whoever is available.
2	Process Documented	Severity and escalation defined; on-call configured; basic runbooks available.
3	Runbooks Linked, MTTR Tracked	All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2.
4	Automated Triage	Automated diagnostic data on alert; runbook steps partially automated.
5	Self-Healing	AIOps incident correlation; MTTR < 5 minutes for known error classes.

Level

Name

Criteria

Ad-hoc

No defined process; incidents handled by whoever is available.

Process Documented

Severity and escalation defined; on-call configured; basic runbooks available.

Runbooks Linked, MTTR Tracked

All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2.

Automated Triage

Automated diagnostic data on alert; runbook steps partially automated.

Self-Healing

AIOps incident correlation; MTTR < 5 minutes for known error classes.

Terraform Checks

waf-rel-060.tf.aws.sns-topic-alarm-action

Checks: CloudWatch alarms have alarm_actions and ok_actions for on-call notification.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "api_errors" { alarm_name = "payment-api-errors" # ... metric config ... alarm_actions = [aws_sns_topic.oncall.arn] ok_actions = [aws_sns_topic.oncall.arn] alarm_description = jsonencode({ runbook = "https://wiki/rb/payment" severity = "SEV2" }) }`	`resource "aws_cloudwatch_metric_alarm" "api_errors" { alarm_name = "payment-api-errors" # ... metric config ... # No alarm_actions – # alert fires silently # WAF-REL-060 Violation }`

resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  alarm_actions =
    [aws_sns_topic.oncall.arn]
  ok_actions =
    [aws_sns_topic.oncall.arn]
  alarm_description = jsonencode({
    runbook = "https://wiki/rb/payment"
    severity = "SEV2"
  })
}

resource "aws_cloudwatch_metric_alarm"
    "api_errors" {
  alarm_name = "payment-api-errors"
  # ... metric config ...
  # No alarm_actions –
  # alert fires silently
  # WAF-REL-060 Violation
}

Remediation: Point alarm_actions and ok_actions to the SNS topic. Populate alarm_description with runbook URL and severity classification.

Evidence

Type	Required	Description
Governance	✅ Required	Incident response plan with severity, escalation paths and on-call structure.
Process	✅ Required	Post-incident review records for all SEV1/SEV2 incidents in the last 12 months.
Config	Optional	On-call schedule in PagerDuty/OpsGenie with current rotation.
Governance	Optional	Runbook catalog with links to all critical alert runbooks.

Type

Required

Description

Governance

✅ Required

Incident response plan with severity, escalation paths and on-call structure.

Process

✅ Required

Post-incident review records for all SEV1/SEV2 incidents in the last 12 months.

Config

Optional

On-call schedule in PagerDuty/OpsGenie with current rotation.

Governance

Optional

Runbook catalog with links to all critical alert runbooks.

Related Controls

Best Practice

Incident Response & Runbooks →