WAF-REL-070 – Disaster Recovery Testing

Pillar: Reliability | Severity: High | Kategorie: Disaster Recovery | Automatisierbar: Teilweise

Beschreibung

Alle als kritisch oder High-Availability klassifizierten Produktions-Workloads MÜSSEN mindestens zweimal jährlich dokumentierte Disaster Recovery Tests durchführen. DR-Tests MÜSSEN das tatsächlich erreichte RTO und RPO messen und mit den Zielen vergleichen. Abweichungen MÜSSEN innerhalb von 30 Tagen adressiert werden. DR-Pläne MÜSSEN nach signifikanten Architekturänderungen aktualisiert werden.

Rationale

Ein untesteter DR-Plan ist eine unvalidierte Hypothese. DR-Pläne, die nie getestet wurden, scheitern in realen Katastrophen systematisch: veraltete Runbooks, fehlende Automatisierung, Infrastruktur-Drift und undokumentierte manuelle Schritte. Nur regelmäßiges Testen validiert, dass RTO-Ziele tatsächlich erreichbar sind.

Bedrohungskontext

Risiko	Beschreibung
Ungetesteter DR-Plan	DR-Plan 18 Monate alt; Architektur hat sich geändert; Recovery dauert 4x länger als RTO.
Fehlende DB-Restore-Schritte	Restore-Prozedur scheitert an nicht-dokumentiertem manuellem Schritt.
Verschlüsselungsschlüssel nicht verfügbar	KMS-Schlüssel im ausgefallenen Account gespeichert; Backup nicht entschlüsselbar.
IaC-Abhängigkeiten fehlen	Terraform nimmt Ressourcen an, die im Ziel-Account nicht existieren.

Risiko

Beschreibung

Ungetesteter DR-Plan

DR-Plan 18 Monate alt; Architektur hat sich geändert; Recovery dauert 4x länger als RTO.

Fehlende DB-Restore-Schritte

Restore-Prozedur scheitert an nicht-dokumentiertem manuellem Schritt.

Verschlüsselungsschlüssel nicht verfügbar

KMS-Schlüssel im ausgefallenen Account gespeichert; Backup nicht entschlüsselbar.

IaC-Abhängigkeiten fehlen

Terraform nimmt Ressourcen an, die im Ziel-Account nicht existieren.

Anforderung

DR-Plan mit RTO/RPO-Zielen pro Workload, versioniert
Halbjährliche DR-Tests mit gemessenem RTO/RPO und dokumentierten Ergebnissen
Abweichungen von RTO/RPO-Zielen: Remediation-Plan innerhalb 30 Tage
DR-Plan aktualisiert nach jeder signifikanten Architekturänderung
Automatisierte DR-Failover-Prozeduren via IaC wo möglich
Route 53 oder Traffic Manager Failover Routing für kritische DNS-Einträge

Implementierungsanleitung

DR-Scope definieren: Welche Szenarien (AZ, Region, Account-Kompromittierung)?
RTO/RPO messen: Aktuellen Baseline-RTO durch Test ermitteln; Ziel daraus ableiten
DNS-Failover vorbereiten: Route 53 Health Check + Failover Record oder Traffic Manager
IaC für DR: Failover-Procedure via Terraform automatisieren
Test dokumentieren: Vorlage: Startzeit, Endzeit, tatsächliches RTO, Probleme, Unterschrift
Testkalender: Halbjährliche Tests im Kalender; nach jeder Major-Architecture-Änderung

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Kein DR-Plan	Keine dokumentierte Recovery-Strategie.
2	Plan dokumentiert, jährlich getestet	DR-Plan vorhanden; jährlicher Test; Ergebnisse nicht systematisch dokumentiert.
3	Halbjährlich + dokumentiert	DR-Tests mind. 2x/Jahr; Ergebnisse dokumentiert mit RTO/RPO; Deviations tracked.
4	Quartalsweise automatisiert	DR-Prozeduren via IaC automatisiert; quartalsweise Tests < 2h; GameDay jährlich.
5	Kontinuierliche Validierung	RTO < 15 Minuten nachgewiesen; automatisierte monatliche Komponententests.

Level

Bezeichnung

Kriterien

Kein DR-Plan

Keine dokumentierte Recovery-Strategie.

Plan dokumentiert, jährlich getestet

DR-Plan vorhanden; jährlicher Test; Ergebnisse nicht systematisch dokumentiert.

Halbjährlich + dokumentiert

DR-Tests mind. 2x/Jahr; Ergebnisse dokumentiert mit RTO/RPO; Deviations tracked.

Quartalsweise automatisiert

DR-Prozeduren via IaC automatisiert; quartalsweise Tests < 2h; GameDay jährlich.

Kontinuierliche Validierung

RTO < 15 Minuten nachgewiesen; automatisierte monatliche Komponententests.

Terraform Checks

waf-rel-070.tf.aws.route53-health-check-failover

Prüft: Route 53 Health Check mit explizitem failure_threshold und request_interval.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_route53_health_check" "primary" { fqdn = "api.example.com" port = 443 type = "HTTPS" resource_path = "/health/ready" failure_threshold = 3 request_interval = 10 tags = { Name = "payment-api-hc" } }`	`resource "aws_route53_record" "api" { zone_id = var.hosted_zone_id name = "api.example.com" type = "A" ttl = 300 records = [var.primary_ip] # Kein Failover Routing, # kein Health Check # WAF-REL-070 Violation }`

resource "aws_route53_health_check"
    "primary" {
  fqdn              = "api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health/ready"
  failure_threshold = 3
  request_interval  = 10
  tags = {
    Name = "payment-api-hc"
  }
}

resource "aws_route53_record" "api" {
  zone_id = var.hosted_zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 300
  records = [var.primary_ip]
  # Kein Failover Routing,
  # kein Health Check
  # WAF-REL-070 Violation
}

Remediation: Route 53 Failover Routing Policy mit Health Check konfigurieren. DNS-TTL auf 30s reduzieren für schnelles Failover.

Evidenz

Typ	Pflicht	Beschreibung
Process	✅ Pflicht	DR-Testberichte letzter 12 Monate: RTO/RPO erreicht, Probleme, Unterschrift.
Governance	✅ Pflicht	DR-Plan mit RTO/RPO-Zielen, Testplan und Datum der letzten Aktualisierung.
IaC	Optional	Automatisierungsskripte oder Terraform für DR-Failover-Prozeduren.
Process	Optional	DR-Testkalender für die nächsten 12 Monate.

Typ

Pflicht

Beschreibung

Process

✅ Pflicht

DR-Testberichte letzter 12 Monate: RTO/RPO erreicht, Probleme, Unterschrift.

Governance

✅ Pflicht

DR-Plan mit RTO/RPO-Zielen, Testplan und Datum der letzten Aktualisierung.

IaC

Optional

Automatisierungsskripte oder Terraform für DR-Failover-Prozeduren.

Process

Optional

DR-Testkalender für die nächsten 12 Monate.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

Backup, Recovery & Disaster Recovery →

WAF-REL-070 – Disaster Recovery Testing

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-070.tf.aws.route53-health-check-failover

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice