Best Practice: Incident Response & Runbooks

Context

Without a structured incident response process, on-call engineers spend valuable minutes discussing severity classifications, looking up contacts and reconstructing diagnostic steps that have already been performed before. Every minute of uncertainty in an incident costs MTTR – and thus availability.

Common problems without a structured IR process:

SEV1 incident treated as minor because severity criteria are unclear
Key engineer unreachable; no backup defined
Same incident occurs for the third time because post-mortem actions were not tracked
Runbook exists but is 18 months old and references deleted services

Related Controls

WAF-REL-060 – Incident Response & Runbook Readiness

Target State

Clearly defined severity levels with objective criteria
On-call rotation with primary and secondary contact
Runbooks for the top 5 alerts per service – linked from the alert body
Blameless post-mortems for SEV1/SEV2 within 5 business days
MTTR as a tracked reliability metric

Technical Implementation

Step 1: Severity Definitions

# docs/incident-response/severity-definitions.yml
severity_levels:
  SEV1:
    name: "Critical"
    description: "Complete service outage or data loss in production"
    criteria:
      - "Service unavailable for > 5% of users"
      - "Data loss confirmed or suspected"
      - "SLO error budget fully exhausted"
      - "Revenue-generating functionality completely unavailable"
    response_time_sla: "15 minutes"
    escalation:
      primary: "On-call Engineer"
      secondary: "Engineering Manager (after 20min)"
      executive: "VP Engineering (after 45min)"
    communication:
      internal: "Slack #incidents every 30min"
      external: "Status page update within 30min"

  SEV2:
    name: "High"
    description: "Major degradation, partial outage or SLO burn"
    criteria:
      - "Error rate > 5x normal"
      - "Latency > 3x p99 SLO"
      - "Error budget burn rate > 14x"
      - "Critical feature unavailable for < 50% of users"
    response_time_sla: "30 minutes"
    escalation:
      primary: "On-call Engineer"
      secondary: "Team Lead (after 45min)"

  SEV3:
    name: "Medium"
    description: "Non-critical feature degradation, slow burn"
    criteria:
      - "Non-critical feature unavailable"
      - "Error budget burn rate 6x–14x"
      - "Performance degradation noticed but SLO not at risk"
    response_time_sla: "4 hours"
    escalation:
      primary: "On-call Engineer (next business day if outside hours)"

  SEV4:
    name: "Low"
    description: "Cosmetic issue, monitoring noise, minor inconvenience"
    response_time_sla: "Next sprint"
    escalation:
      primary: "Development team via ticket"

Step 2: PagerDuty Configuration via Terraform

# terraform/monitoring/pagerduty.tf

resource "pagerduty_schedule" "primary" {
  name      = "payment-service-primary"
  time_zone = "Europe/Berlin"

  layer {
    name                         = "weekly-rotation"
    start                        = "2026-01-01T08:00:00+01:00"
    rotation_virtual_start       = "2026-01-06T08:00:00+01:00"
    rotation_turn_length_seconds = 604800  # 7 days

    users = [
      pagerduty_user.engineer1.id,
      pagerduty_user.engineer2.id,
      pagerduty_user.engineer3.id,
    ]
  }
}

resource "pagerduty_escalation_policy" "payment" {
  name      = "payment-service-escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 30

    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_manager.id
    }
  }
}

resource "pagerduty_service" "payment_api" {
  name              = "Payment API – Production"
  escalation_policy = pagerduty_escalation_policy.payment.id

  incident_urgency_rule {
    type    = "use_support_hours"
    urgency = "high"

    during_support_hours { type = "constant", urgency = "high" }
    outside_support_hours { type = "constant", urgency = "low" }
  }
}

Step 3: Runbook Structure

# Runbook: Payment API – High Error Rate

**Alert ID:** payment-api-error-rate-sev2
**Severity:** SEV2
**Owner:** payments-team
**Last Updated:** 2026-03-01

## Symptom
CloudWatch Alarm `slo-payment-api-fast-burn` triggered.
Error Rate > 2% over 5 minutes.

## Hypotheses (sorted by frequency)
1. Downstream Payment Gateway degraded
2. Database Connection Pool exhausted
3. Invalid deployment rolled out

## Diagnosis

### 1. Open dashboard
https://grafana.example.com/d/payment-api-slo

### 2. Check error types
```
# CloudWatch Insights
fields @timestamp, @message
| filter statusCode >= 500
| stats count(*) by bin(1m), statusCode
| sort @timestamp desc
| limit 20
```

### 3. Check Payment Gateway status
- Status Page: https://status.payment-gateway.example.com
- Circuit Breaker State: `curl https://api.payment.internal/actuator/circuitbreaker`

### 4. Check DB Connection Pool
```
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT * FROM pg_stat_database WHERE datname = 'payment_db';
```

## Remediation

### A: Payment Gateway degraded
1. Manually open circuit breaker if not automatic: `POST /admin/circuitbreaker/open`
2. Activate Queued-Payment-Mode: `POST /admin/features/queued-payments/enable`
3. Update status page

### B: DB Connection Pool exhausted
1. `kubectl get pods -n payment | grep -E 'Terminating|Error'` – find hanging pods
2. Delete hanging pods: `kubectl delete pod <pod-name> -n payment --grace-period=0`
3. Check DB connection limit: `SHOW max_connections;` in DB

## Escalation
After 20 minutes without progress: @engineering-manager via Slack
After 45 minutes: Consider SEV1, inform VP Engineering

## Post-Mortem Template
https://wiki.example.com/post-mortem-template

Step 4: CloudWatch Alarm with Runbook Link

resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
  alarm_name          = "payment-api-error-rate-sev2"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  datapoints_to_alarm = 4

  metric_name = "5XXError"
  namespace   = "AWS/ApiGateway"
  period      = 60
  statistic   = "Sum"
  threshold   = 10

  # Runbook URL in alert body: directly navigable for on-call
  alarm_description = jsonencode({
    severity  = "SEV2"
    service   = "payment-api"
    runbook   = "https://wiki.example.com/runbooks/payment-api-error-rate"
    dashboard = "https://grafana.example.com/d/payment-api-slo"
  })

  alarm_actions = [aws_sns_topic.oncall_sev2.arn]
  ok_actions    = [aws_sns_topic.oncall_sev2.arn]

  tags = var.mandatory_tags
}

Post-Mortem Template

# Post-Mortem: [Service] – [Short description]

**Date:** YYYY-MM-DD
**Severity:** SEV1/SEV2
**Duration:** X hours Y minutes
**Impacted Users:** ~N users / % traffic
**Author:** @name

## Timeline

| Time | Event |
|------|-------|
| HH:MM | Alarm triggered |
| HH:MM | On-call responds |
| HH:MM | Root cause identified |
| HH:MM | Mitigation deployed |
| HH:MM | Service fully restored |

## Root Cause
[One-sentence statement: "The outage was caused by..."]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive 1]
- [Positive 2]

## Action Items

| Priority | Action | Owner | Due Date |
|----------|--------|-------|----------|
| P1 | ... | @name | YYYY-MM-DD |
| P2 | ... | @name | YYYY-MM-DD |

Typical Anti-Patterns

Runbook outdated: Runbook references service names or endpoints that no longer exist
No OK action: Alarm fires when things get bad, but no signal when things improve → false recovery
Post-mortem without action items: Reviews without concrete tasks do not prevent recurrence
Severity escalation too early: Executives are unnecessarily paged for SEV3

Metrics

MTTR: Mean time from alarm to service restore (target: < 30 minutes for SEV2)
MTTD: Time from first error to alarm triggered (target: < 5 minutes)
Post-Mortem Compliance Rate: % of SEV1/SEV2 incidents with documented post-mortem (target: 100%)
Action Item Closure Rate: % of post-mortem action items completed by the target date

Maturity Level

Level 1 – Ad-hoc incident response, no process
Level 2 – Severity defined, on-call configured, basic runbooks
Level 3 – All critical alerts with runbook link; MTTR tracked; post-mortems for SEV1/SEV2
Level 4 – Automated diagnostic data collection; runbook steps partially automated
Level 5 – AIOps incident correlation; MTTR < 5 minutes for known error classes