Best Practice: Incident Response & Runbooks

Kontext

Ohne strukturierten Incident Response-Prozess verbringen On-Call-Engineers wertvolle Minuten damit, Severity-Einordnungen zu diskutieren, Kontakte zu suchen und Diagnose-Schritte zu rekonstruieren, die bereits einmal durchgeführt wurden. Jede Minute Unklarheit im Incident kostet MTTR – und damit Verfügbarkeit.

Häufige Probleme ohne strukturierten IR-Prozess:

SEV1-Incident wird als Minor behandelt, weil Severity-Kriterien unklar sind
Key-Engineer nicht erreichbar; kein Backup definiert
Gleicher Incident tritt zum dritten Mal auf, weil Post-Mortem-Actions nicht verfolgt wurden
Runbook existiert, ist aber 18 Monate alt und referenziert gelöschte Services

Zugehörige Controls

WAF-REL-060 – Incident Response & Runbook Readiness

Zielbild

Klar definierte Severity-Stufen mit objektiven Kriterien
On-Call-Rotation mit primärem und sekundärem Kontakt
Runbooks für die häufigsten 5 Alerts je Service – verlinkt aus Alert-Body
Blameless Post-Mortems für SEV1/SEV2 innerhalb 5 Werktage
MTTR als getracktes Reliability-Metric

Technische Umsetzung

Schritt 1: Severity-Definitionen

# docs/incident-response/severity-definitions.yml
severity_levels:
  SEV1:
    name: "Critical"
    description: "Complete service outage or data loss in production"
    criteria:
      - "Service unavailable for > 5% of users"
      - "Data loss confirmed or suspected"
      - "SLO error budget fully exhausted"
      - "Revenue-generating functionality completely unavailable"
    response_time_sla: "15 minutes"
    escalation:
      primary: "On-call Engineer"
      secondary: "Engineering Manager (after 20min)"
      executive: "VP Engineering (after 45min)"
    communication:
      internal: "Slack #incidents every 30min"
      external: "Status page update within 30min"

  SEV2:
    name: "High"
    description: "Major degradation, partial outage or SLO burn"
    criteria:
      - "Error rate > 5x normal"
      - "Latency > 3x p99 SLO"
      - "Error budget burn rate > 14x"
      - "Critical feature unavailable for < 50% of users"
    response_time_sla: "30 minutes"
    escalation:
      primary: "On-call Engineer"
      secondary: "Team Lead (after 45min)"

  SEV3:
    name: "Medium"
    description: "Non-critical feature degradation, slow burn"
    criteria:
      - "Non-critical feature unavailable"
      - "Error budget burn rate 6x–14x"
      - "Performance degradation noticed but SLO not at risk"
    response_time_sla: "4 hours"
    escalation:
      primary: "On-call Engineer (next business day if outside hours)"

  SEV4:
    name: "Low"
    description: "Cosmetic issue, monitoring noise, minor inconvenience"
    response_time_sla: "Next sprint"
    escalation:
      primary: "Development team via ticket"

Schritt 2: PagerDuty Konfiguration via Terraform

# terraform/monitoring/pagerduty.tf

resource "pagerduty_schedule" "primary" {
  name      = "payment-service-primary"
  time_zone = "Europe/Berlin"

  layer {
    name                         = "weekly-rotation"
    start                        = "2026-01-01T08:00:00+01:00"
    rotation_virtual_start       = "2026-01-06T08:00:00+01:00"
    rotation_turn_length_seconds = 604800  # 7 Tage

    users = [
      pagerduty_user.engineer1.id,
      pagerduty_user.engineer2.id,
      pagerduty_user.engineer3.id,
    ]
  }
}

resource "pagerduty_escalation_policy" "payment" {
  name      = "payment-service-escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 30

    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_manager.id
    }
  }
}

resource "pagerduty_service" "payment_api" {
  name              = "Payment API – Production"
  escalation_policy = pagerduty_escalation_policy.payment.id

  incident_urgency_rule {
    type    = "use_support_hours"
    urgency = "high"

    during_support_hours { type = "constant", urgency = "high" }
    outside_support_hours { type = "constant", urgency = "low" }
  }
}

Schritt 3: Runbook-Struktur

# Runbook: Payment API – High Error Rate

**Alert ID:** payment-api-error-rate-sev2
**Severity:** SEV2
**Owner:** payments-team
**Last Updated:** 2026-03-01

## Symptom
CloudWatch Alarm `slo-payment-api-fast-burn` ausgelöst.
Error Rate > 2% über 5 Minuten.

## Hypothesen (nach Häufigkeit sortiert)
1. Downstream Payment Gateway degradiert
2. Database Connection Pool erschöpft
3. Ungültiges Deployment ausgerollt

## Diagnose

### 1. Dashboard öffnen
https://grafana.example.com/d/payment-api-slo

### 2. Fehlertypen prüfen
```
# CloudWatch Insights
fields @timestamp, @message
| filter statusCode >= 500
| stats count(*) by bin(1m), statusCode
| sort @timestamp desc
| limit 20
```

### 3. Payment Gateway Status prüfen
- Status Page: https://status.payment-gateway.example.com
- Circuit Breaker State: `curl https://api.payment.internal/actuator/circuitbreaker`

### 4. DB Connection Pool prüfen
```
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT * FROM pg_stat_database WHERE datname = 'payment_db';
```

## Remediation

### A: Payment Gateway degradiert
1. Circuit Breaker manuell öffnen wenn nicht automatisch: `POST /admin/circuitbreaker/open`
2. Queued-Payment-Mode aktivieren: `POST /admin/features/queued-payments/enable`
3. Status-Page aktualisieren

### B: DB Connection Pool erschöpft
1. `kubectl get pods -n payment | grep -E 'Terminating|Error'` – hängende Pods finden
2. Hängende Pods löschen: `kubectl delete pod <pod-name> -n payment --grace-period=0`
3. DB Connection Limit prüfen: `SHOW max_connections;` in DB

## Eskalation
Nach 20 Minuten ohne Fortschritt: @engineering-manager via Slack
Nach 45 Minuten: SEV1 erwägen, VP Engineering informieren

## Post-Mortem Template
https://wiki.example.com/post-mortem-template

Schritt 4: CloudWatch Alarm mit Runbook-Link

resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
  alarm_name          = "payment-api-error-rate-sev2"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  datapoints_to_alarm = 4

  metric_name = "5XXError"
  namespace   = "AWS/ApiGateway"
  period      = 60
  statistic   = "Sum"
  threshold   = 10

  # Runbook-URL im Alert-Body: direkt navigierbar für On-Call
  alarm_description = jsonencode({
    severity  = "SEV2"
    service   = "payment-api"
    runbook   = "https://wiki.example.com/runbooks/payment-api-error-rate"
    dashboard = "https://grafana.example.com/d/payment-api-slo"
  })

  alarm_actions = [aws_sns_topic.oncall_sev2.arn]
  ok_actions    = [aws_sns_topic.oncall_sev2.arn]

  tags = var.mandatory_tags
}

Post-Mortem Template

# Post-Mortem: [Service] – [Kurzbeschreibung]

**Datum:** YYYY-MM-DD
**Severity:** SEV1/SEV2
**Duration:** X Stunden Y Minuten
**Impacted Users:** ~N users / % traffic
**Author:** @name

## Timeline

| Zeit | Ereignis |
|------|----------|
| HH:MM | Alarm ausgelöst |
| HH:MM | On-Call antwortet |
| HH:MM | Root Cause identifiziert |
| HH:MM | Mitigation deployed |
| HH:MM | Service fully restored |

## Root Cause
[Einstufiger Satz: "Der Ausfall wurde verursacht durch..."]

## Contributing Factors
- [Faktor 1]
- [Faktor 2]

## What Went Well
- [Positiv 1]
- [Positiv 2]

## Action Items

| Priorität | Aktion | Owner | Due Date |
|-----------|--------|-------|----------|
| P1 | ... | @name | YYYY-MM-DD |
| P2 | ... | @name | YYYY-MM-DD |

Typische Fehlmuster

Runbook veraltet: Runbook referenziert Service-Namen oder Endpoints, die nicht mehr existieren
Keine OK-Action: Alarm löst aus wenn es schlimm wird, aber kein Signal wenn es besser wird → False Recovery
Post-Mortem ohne Action Items: Reviews ohne konkrete Tasks verhindern keine Wiederholung
Severity-Eskalation zu früh: Führungskräfte werden unnötig oft für SEV3 paginiert

Metriken

MTTR: Mittlere Zeit von Alarm bis Service-Restore (Ziel: < 30 Minuten für SEV2)
MTTD: Zeit von erstem Fehler bis Alarm ausgelöst (Ziel: < 5 Minuten)
Post-Mortem Compliance Rate: % der SEV1/SEV2 Incidents mit dokumentiertem Post-Mortem (Ziel: 100%)
Action Item Closure Rate: % der Post-Mortem Action Items innerhalb des Zieldatums abgeschlossen

Reifegrad

Level 1 – Ad-hoc Incident Response, kein Prozess
Level 2 – Severity definiert, On-Call konfiguriert, Basis-Runbooks
Level 3 – Alle Critical-Alerts mit Runbook-Link; MTTR getrackt; Post-Mortems für SEV1/SEV2
Level 4 – Automatisierte Diagnose-Datensammlung; Runbooks teilweise automatisiert
Level 5 – AIOps-Incident-Correlation; MTTR < 5 Minuten für bekannte Fehlerklassen