WAF-OPS-060 – Runbook & Operational Documentation Coverage

Pillar: Operational Excellence | Severity: Medium | Kategorie: Dokumentation | Automatisierbar: Niedrig

Beschreibung

Jeder Produktions-Workload MUSS Runbooks haben, die alle bekannten Fehlerfälle und Routine-Aufgaben abdecken. Runbooks MÜSSEN versioniert, regelmäßig reviewed (mindestens quartalsweise) und ohne Authentifizierungsbarriere zugänglich sein. Jeder paging Alert MUSS eine Runbook-URL enthalten.

Rationale

Undokumentiertes Betriebswissen schafft Single Points of Failure in Menschen. Wenn der Ingenieur, der "weiß wie man den Payment Processor neu startet", nicht erreichbar ist, ist das gesamte Team blockiert. Runbooks kodifizieren dieses Wissen, reduzieren MTTR und ermöglichen Junior-Engineers, Incidents selbstständig zu behandeln. Runbook-Qualität korreliert direkt mit MTTR: Teams mit umfassenden Runbooks lösen Incidents 40–60% schneller.

Bedrohungskontext

Risiko	Beschreibung
Wissenssilo	Kritisches Betriebswissen in einzelnen Köpfen: Urlaub oder Kündigung → Betrieb gefährdet.
Verlängerte MTTR	Ohne Runbook: Diagnose und Behebung auf Intuition – Stunden statt Minuten.
Veraltete Informationen	Runbooks die nicht aktualisiert werden: falsche Schritte können Incident verschlimmern.
On-Call-Unfähigkeit	Junior-Engineers können ohne Runbooks keine selbstständigen On-Call-Shifts übernehmen.

Risiko

Beschreibung

Wissenssilo

Kritisches Betriebswissen in einzelnen Köpfen: Urlaub oder Kündigung → Betrieb gefährdet.

Verlängerte MTTR

Ohne Runbook: Diagnose und Behebung auf Intuition – Stunden statt Minuten.

Veraltete Informationen

Runbooks die nicht aktualisiert werden: falsche Schritte können Incident verschlimmern.

On-Call-Unfähigkeit

Junior-Engineers können ohne Runbooks keine selbstständigen On-Call-Shifts übernehmen.

Anforderung

Alle paging Alerts MÜSSEN eine Runbook-URL in der Beschreibung enthalten
Runbooks MÜSSEN in Version-Control gespeichert sein (Wiki mit Versionshistorie)
Quarterly Review MUSS Aktualität und Vollständigkeit sicherstellen
Runbook-Coverage MUSS gemessen werden (Ziel: >= 90% für kritische Services)
Alle On-Call-Engineers MÜSSEN Zugang zu Runbooks ohne Authentifizierungsbarriere haben

Implementierungsanleitung

Runbook-Template erstellen: Übersicht, Trigger, Auswirkung, Diagnose, Remediation, Eskalation, Author
Runbooks im Repository speichern: docs/runbooks/<service>/ mit Revisions-History
Alerts mit Runbooks verknüpfen: alarm_description (CloudWatch), runbook_url (Prometheus), description (Azure)
Coverage-Metrik tracken: (Services mit Runbooks / Gesamt Services) × 100; Ziel >= 90%
After-Incident-Updates: Runbooks innerhalb 48h nach Incident aktualisieren
Quarterly Review: Veraltete Runbooks für dekkommissionierte Komponenten entfernen

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Keine Runbooks	Kein dokumentiertes Betriebswissen. On-Call hängt an spezifischen Personen.
2	Basis-Runbooks	Runbooks für schlimmste Szenarien. Nicht mit Alerts verknüpft. Kein formaler Review.
3	Alle Alerts verlinkt	Alle paging Alerts haben Runbook-URLs. Runbooks in Version-Control. Quarterly Review.
4	Metriken & Living Docs	Coverage-Metrik >= 90%. Runbooks binnen 48h nach Incident aktualisiert.
5	Self-Service-Automation	Kritische Runbook-Schritte automatisiert. Coverage 100%. Toil-Trend positiv.

Level

Bezeichnung

Kriterien

Keine Runbooks

Kein dokumentiertes Betriebswissen. On-Call hängt an spezifischen Personen.

Basis-Runbooks

Runbooks für schlimmste Szenarien. Nicht mit Alerts verknüpft. Kein formaler Review.

Alle Alerts verlinkt

Alle paging Alerts haben Runbook-URLs. Runbooks in Version-Control. Quarterly Review.

Metriken & Living Docs

Coverage-Metrik >= 90%. Runbooks binnen 48h nach Incident aktualisiert.

Self-Service-Automation

Kritische Runbook-Schritte automatisiert. Coverage 100%. Toil-Trend positiv.

Terraform Checks

waf-ops-060.tf.aws.cloudwatch-alarm-runbook-annotation

Prüft: CloudWatch Alarm alarm_description enthält eine HTTP(S)-URL (Runbook).

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "errors" { alarm_name = "payment-5xx-rate" alarm_description = "5xx Rate hoch. " + "Runbook: https://wiki/payment/5xx-errors" alarm_actions = [aws_sns_topic.oncall.arn] /* ... */ }`	`resource "aws_cloudwatch_metric_alarm" "errors" { alarm_name = "payment-5xx-rate" alarm_description = "High error rate" # Kein Runbook-URL # WAF-OPS-060 Violation alarm_actions = [aws_sns_topic.oncall.arn] }`

resource "aws_cloudwatch_metric_alarm" "errors" {
  alarm_name    = "payment-5xx-rate"
  alarm_description = "5xx Rate hoch. " +
    "Runbook: https://wiki/payment/5xx-errors"
  alarm_actions = [aws_sns_topic.oncall.arn]
  /* ... */
}

resource "aws_cloudwatch_metric_alarm" "errors" {
  alarm_name    = "payment-5xx-rate"
  alarm_description = "High error rate"
  # Kein Runbook-URL
  # WAF-OPS-060 Violation
  alarm_actions = [aws_sns_topic.oncall.arn]
}

waf-ops-060.tf.aws.prometheus-alert-runbook-label

Prüft: Prometheus Alert Rules haben runbook_url Annotation.

# Compliant
- alert: HighErrorRate
  expr: rate(http_requests_total{code=~"5.."}[5m]) > 0.01
  annotations:
    runbook_url: "https://wiki/runbooks/payment-errors"
    # WAF-OPS-060 compliant

Remediation: alarm_description mit Runbook-URL versehen. Prometheus-Alerts mit runbook_url: "https://…" Annotation erweitern.

Evidenz

Typ	Pflicht	Beschreibung
Process	✅ Pflicht	Runbook-Repository mit Versionsverlauf und Letzte-Überprüfung-Datum.
Governance	✅ Pflicht	Runbook-Coverage-Report: Anzahl Services mit Runbooks vs. Gesamt.
Config	Optional	Alert-Konfiguration mit Runbook-URL-Annotation.
Process	Optional	Quarterly-Review-Protokoll oder JIRA-Tickets für Runbook-Review-Aufgaben.

Typ

Pflicht

Beschreibung

Process

✅ Pflicht

Runbook-Repository mit Versionsverlauf und Letzte-Überprüfung-Datum.

Governance

✅ Pflicht

Runbook-Coverage-Report: Anzahl Services mit Runbooks vs. Gesamt.

Config

Optional

Alert-Konfiguration mit Runbook-URL-Annotation.

Process

Optional

Quarterly-Review-Protokoll oder JIRA-Tickets für Runbook-Review-Aufgaben.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 20000-1:2018	8.2.3 – Change management; 8.3.4 – Release management; 8.4.1 – Service delivery; 8.4.2 – Service reporting
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy; Operational Excellence Pillar – Monitor; Operational Excellence Pillar – Improve
DORA	DORA 2024 – Technical practices; DORA 2024 – Organizational culture
SOC 2 Type II	CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring; CC7.2 – Evaluation of system changes
Google SRE Book	Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives; Chapter 4 – Eliminating toil
PCI DSS v4.0	Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices; Req 6.6 – Application security
FinOps Foundation	Core Module – Financial accountability; Management Layer – Cost governance
BSI C5:2020	OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity; OPS-04 – Change management
NIST SP 800-53	CM-1 – Configuration management policy; CM-2 – Configuration management; CM-6 – Configuration settings; CM-8 – Information system integration; CA-7 – Continuous monitoring
NIST CSF 2.0	GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring
TISAX	Information security – Change management
ANSSI SecNumCloud	Domain – Change management
BIO	BIO – Veranderingenbeheer
ENS High	op.exp.6 – Gestión de cambios
UK NCSC CAF	A4 – Policy and assurance; A5 – Continual improvement
CMMC 2.0	CM.L2-3.4.1 – Establish baseline configurations; CA.L2-3.12.1 – Periodically assess security controls
IRAP	ISM – Change management
CCCS PBMM	CM-2 – Baseline configuration; CA-7 – Continuous monitoring
MAS TRM	Ch.3 – Technology risk governance; Ch.9 – Change management
ISMAP	Operational excellence and continuous improvement
FISC	Operational measures – Change management

Framework

Controls

ISO/IEC 20000-1:2018

8.2.3 – Change management; 8.3.4 – Release management; 8.4.1 – Service delivery; 8.4.2 – Service reporting

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy; Operational Excellence Pillar – Monitor; Operational Excellence Pillar – Improve

DORA

DORA 2024 – Technical practices; DORA 2024 – Organizational culture

SOC 2 Type II

CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring; CC7.2 – Evaluation of system changes

Google SRE Book

Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives; Chapter 4 – Eliminating toil

PCI DSS v4.0

Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices; Req 6.6 – Application security

FinOps Foundation

Core Module – Financial accountability; Management Layer – Cost governance

BSI C5:2020

OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity; OPS-04 – Change management

NIST SP 800-53

CM-1 – Configuration management policy; CM-2 – Configuration management; CM-6 – Configuration settings; CM-8 – Information system integration; CA-7 – Continuous monitoring

NIST CSF 2.0

GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring

TISAX

Information security – Change management

ANSSI SecNumCloud

Domain – Change management

BIO

BIO – Veranderingenbeheer

ENS High

op.exp.6 – Gestión de cambios

UK NCSC CAF

A4 – Policy and assurance; A5 – Continual improvement

CMMC 2.0

CM.L2-3.4.1 – Establish baseline configurations; CA.L2-3.12.1 – Periodically assess security controls

IRAP

ISM – Change management

CCCS PBMM

CM-2 – Baseline configuration; CA-7 – Continuous monitoring

MAS TRM

Ch.3 – Technology risk governance; Ch.9 – Change management

ISMAP

Operational excellence and continuous improvement

FISC

Operational measures – Change management

Best Practice

Runbooks und Betriebsdokumentation pflegen →

WAF-OPS-060 – Runbook & Operational Documentation Coverage

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-ops-060.tf.aws.cloudwatch-alarm-runbook-annotation

waf-ops-060.tf.aws.prometheus-alert-runbook-label

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice