Best Practice: Incident Response & Runbooks
Context
Without a structured incident response process, on-call engineers spend valuable minutes discussing severity classifications, looking up contacts and reconstructing diagnostic steps that have already been performed before. Every minute of uncertainty in an incident costs MTTR – and thus availability.
Common problems without a structured IR process:
-
SEV1 incident treated as minor because severity criteria are unclear
-
Key engineer unreachable; no backup defined
-
Same incident occurs for the third time because post-mortem actions were not tracked
-
Runbook exists but is 18 months old and references deleted services
Related Controls
-
WAF-REL-060 – Incident Response & Runbook Readiness
Target State
-
Clearly defined severity levels with objective criteria
-
On-call rotation with primary and secondary contact
-
Runbooks for the top 5 alerts per service – linked from the alert body
-
Blameless post-mortems for SEV1/SEV2 within 5 business days
-
MTTR as a tracked reliability metric
Technical Implementation
Step 1: Severity Definitions
# docs/incident-response/severity-definitions.yml
severity_levels:
SEV1:
name: "Critical"
description: "Complete service outage or data loss in production"
criteria:
- "Service unavailable for > 5% of users"
- "Data loss confirmed or suspected"
- "SLO error budget fully exhausted"
- "Revenue-generating functionality completely unavailable"
response_time_sla: "15 minutes"
escalation:
primary: "On-call Engineer"
secondary: "Engineering Manager (after 20min)"
executive: "VP Engineering (after 45min)"
communication:
internal: "Slack #incidents every 30min"
external: "Status page update within 30min"
SEV2:
name: "High"
description: "Major degradation, partial outage or SLO burn"
criteria:
- "Error rate > 5x normal"
- "Latency > 3x p99 SLO"
- "Error budget burn rate > 14x"
- "Critical feature unavailable for < 50% of users"
response_time_sla: "30 minutes"
escalation:
primary: "On-call Engineer"
secondary: "Team Lead (after 45min)"
SEV3:
name: "Medium"
description: "Non-critical feature degradation, slow burn"
criteria:
- "Non-critical feature unavailable"
- "Error budget burn rate 6x–14x"
- "Performance degradation noticed but SLO not at risk"
response_time_sla: "4 hours"
escalation:
primary: "On-call Engineer (next business day if outside hours)"
SEV4:
name: "Low"
description: "Cosmetic issue, monitoring noise, minor inconvenience"
response_time_sla: "Next sprint"
escalation:
primary: "Development team via ticket"
Step 2: PagerDuty Configuration via Terraform
# terraform/monitoring/pagerduty.tf
resource "pagerduty_schedule" "primary" {
name = "payment-service-primary"
time_zone = "Europe/Berlin"
layer {
name = "weekly-rotation"
start = "2026-01-01T08:00:00+01:00"
rotation_virtual_start = "2026-01-06T08:00:00+01:00"
rotation_turn_length_seconds = 604800 # 7 days
users = [
pagerduty_user.engineer1.id,
pagerduty_user.engineer2.id,
pagerduty_user.engineer3.id,
]
}
}
resource "pagerduty_escalation_policy" "payment" {
name = "payment-service-escalation"
num_loops = 2
rule {
escalation_delay_in_minutes = 15
target {
type = "schedule_reference"
id = pagerduty_schedule.primary.id
}
}
rule {
escalation_delay_in_minutes = 30
target {
type = "user_reference"
id = pagerduty_user.engineering_manager.id
}
}
}
resource "pagerduty_service" "payment_api" {
name = "Payment API – Production"
escalation_policy = pagerduty_escalation_policy.payment.id
incident_urgency_rule {
type = "use_support_hours"
urgency = "high"
during_support_hours { type = "constant", urgency = "high" }
outside_support_hours { type = "constant", urgency = "low" }
}
}
Step 3: Runbook Structure
# Runbook: Payment API – High Error Rate
**Alert ID:** payment-api-error-rate-sev2
**Severity:** SEV2
**Owner:** payments-team
**Last Updated:** 2026-03-01
## Symptom
CloudWatch Alarm `slo-payment-api-fast-burn` triggered.
Error Rate > 2% over 5 minutes.
## Hypotheses (sorted by frequency)
1. Downstream Payment Gateway degraded
2. Database Connection Pool exhausted
3. Invalid deployment rolled out
## Diagnosis
### 1. Open dashboard
https://grafana.example.com/d/payment-api-slo
### 2. Check error types
```
# CloudWatch Insights
fields @timestamp, @message
| filter statusCode >= 500
| stats count(*) by bin(1m), statusCode
| sort @timestamp desc
| limit 20
```
### 3. Check Payment Gateway status
- Status Page: https://status.payment-gateway.example.com
- Circuit Breaker State: `curl https://api.payment.internal/actuator/circuitbreaker`
### 4. Check DB Connection Pool
```
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT * FROM pg_stat_database WHERE datname = 'payment_db';
```
## Remediation
### A: Payment Gateway degraded
1. Manually open circuit breaker if not automatic: `POST /admin/circuitbreaker/open`
2. Activate Queued-Payment-Mode: `POST /admin/features/queued-payments/enable`
3. Update status page
### B: DB Connection Pool exhausted
1. `kubectl get pods -n payment | grep -E 'Terminating|Error'` – find hanging pods
2. Delete hanging pods: `kubectl delete pod <pod-name> -n payment --grace-period=0`
3. Check DB connection limit: `SHOW max_connections;` in DB
## Escalation
After 20 minutes without progress: @engineering-manager via Slack
After 45 minutes: Consider SEV1, inform VP Engineering
## Post-Mortem Template
https://wiki.example.com/post-mortem-template
Step 4: CloudWatch Alarm with Runbook Link
resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
alarm_name = "payment-api-error-rate-sev2"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5
datapoints_to_alarm = 4
metric_name = "5XXError"
namespace = "AWS/ApiGateway"
period = 60
statistic = "Sum"
threshold = 10
# Runbook URL in alert body: directly navigable for on-call
alarm_description = jsonencode({
severity = "SEV2"
service = "payment-api"
runbook = "https://wiki.example.com/runbooks/payment-api-error-rate"
dashboard = "https://grafana.example.com/d/payment-api-slo"
})
alarm_actions = [aws_sns_topic.oncall_sev2.arn]
ok_actions = [aws_sns_topic.oncall_sev2.arn]
tags = var.mandatory_tags
}
Post-Mortem Template
# Post-Mortem: [Service] – [Short description]
**Date:** YYYY-MM-DD
**Severity:** SEV1/SEV2
**Duration:** X hours Y minutes
**Impacted Users:** ~N users / % traffic
**Author:** @name
## Timeline
| Time | Event |
|------|-------|
| HH:MM | Alarm triggered |
| HH:MM | On-call responds |
| HH:MM | Root cause identified |
| HH:MM | Mitigation deployed |
| HH:MM | Service fully restored |
## Root Cause
[One-sentence statement: "The outage was caused by..."]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## Action Items
| Priority | Action | Owner | Due Date |
|----------|--------|-------|----------|
| P1 | ... | @name | YYYY-MM-DD |
| P2 | ... | @name | YYYY-MM-DD |
Typical Anti-Patterns
-
Runbook outdated: Runbook references service names or endpoints that no longer exist
-
No OK action: Alarm fires when things get bad, but no signal when things improve → false recovery
-
Post-mortem without action items: Reviews without concrete tasks do not prevent recurrence
-
Severity escalation too early: Executives are unnecessarily paged for SEV3
Metrics
-
MTTR: Mean time from alarm to service restore (target: < 30 minutes for SEV2)
-
MTTD: Time from first error to alarm triggered (target: < 5 minutes)
-
Post-Mortem Compliance Rate: % of SEV1/SEV2 incidents with documented post-mortem (target: 100%)
-
Action Item Closure Rate: % of post-mortem action items completed by the target date
Maturity Level
Level 1 – Ad-hoc incident response, no process
Level 2 – Severity defined, on-call configured, basic runbooks
Level 3 – All critical alerts with runbook link; MTTR tracked; post-mortems for SEV1/SEV2
Level 4 – Automated diagnostic data collection; runbook steps partially automated
Level 5 – AIOps incident correlation; MTTR < 5 minutes for known error classes