WAF-REL-060 – Incident Response & Runbook Readiness
Description
All production workloads MUST have a documented Incident Response (IR) plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST exist for all critical alerts and be linked directly from alert notifications. Post-incident reviews MUST be conducted for SEV1/SEV2 within 5 business days.
Rationale
Without a defined IR process, MTTR rises dramatically because on-call engineers under pressure reconstruct steps that have already been documented. Runbooks encode institutional knowledge and enable consistent incident resolution regardless of the engineer on duty. Post-mortems prevent recurrence through structured root cause analysis.
Threat Context
| Risk | Description |
|---|---|
Extended MTTR |
Without runbooks, engineers spend valuable minutes diagnosing instead of resolving. |
Knowledge Loss |
Key engineer on vacation; no backup has context knowledge for critical service. |
Inconsistent Severity |
Without clear criteria, SEV1 is treated as SEV3; no appropriate escalation. |
Incident Recurrence |
The same root cause causes a third incident; no post-mortem action items tracked. |
Requirement
-
4 severity levels (SEV1–SEV4) with objective, measurable criteria
-
On-call rotation with primary and secondary contact configured
-
Runbooks for all critical alerts; linked directly from the alert body
-
MTTR, MTTD and incident frequency tracked as metrics
-
Post-incident reviews for SEV1/SEV2 within 5 business days
-
Action items from post-mortems tracked to target date
Implementation Guidance
-
Severity definitions: YAML document with measurable criteria (error rate %, user impact %)
-
Set up on-call: PagerDuty/OpsGenie with primary and secondary rotation
-
Top-5 runbooks: Write runbook for the top 5 alerts per service
-
Alert description:
alarm_descriptioncontains runbook URL and severity -
MTTR dashboard: Make incident metrics visible in Grafana or native tool
-
Post-mortem culture: Introduce blameless post-mortem template; mandatory for SEV1/SEV2
Maturity Levels
| Level | Name | Criteria |
|---|---|---|
1 |
Ad-hoc |
No defined process; incidents handled by whoever is available. |
2 |
Process Documented |
Severity and escalation defined; on-call configured; basic runbooks available. |
3 |
Runbooks Linked, MTTR Tracked |
All critical alerts with runbook link; MTTR reviewed monthly; post-mortems for SEV1/SEV2. |
4 |
Automated Triage |
Automated diagnostic data on alert; runbook steps partially automated. |
5 |
Self-Healing |
AIOps incident correlation; MTTR < 5 minutes for known error classes. |
Terraform Checks
waf-rel-060.tf.aws.sns-topic-alarm-action
Checks: CloudWatch alarms have alarm_actions and ok_actions for on-call notification.
| Compliant | Non-Compliant |
|---|---|
|
|
Remediation: Point alarm_actions and ok_actions to the SNS topic.
Populate alarm_description with runbook URL and severity classification.
Evidence
| Type | Required | Description |
|---|---|---|
Governance |
✅ Required |
Incident response plan with severity, escalation paths and on-call structure. |
Process |
✅ Required |
Post-incident review records for all SEV1/SEV2 incidents in the last 12 months. |
Config |
Optional |
On-call schedule in PagerDuty/OpsGenie with current rotation. |
Governance |
Optional |
Runbook catalog with links to all critical alert runbooks. |