WAF-REL-090 – Chaos Engineering & Fault Injection

Pillar: Reliability | Severity: Medium | Kategorie: Chaos Engineering | Automatisierbar: Mittel

Beschreibung

Produktions- und Staging-Workloads MÜSSEN quartalsweise strukturierte Chaos-Experimente mit dokumentierten Hypothesen durchführen. Experimente MÜSSEN Stop Conditions definieren. Jedes Experiment MUSS zuerst in Staging validiert werden, bevor es in Produktion läuft. Ergebnisse MÜSSEN dokumentiert und in Reliability Improvements überführt werden.

Rationale

Chaos Engineering validiert Reliability-Behauptungen empirisch. SLOs, Circuit Breaker, Multi-AZ-Konfigurationen und Health Checks sind Behauptungen über das System-Verhalten unter Fehlerbedingungen. Ohne Chaos Testing bleiben diese Behauptungen unvalidiert. Nur durch kontrollierte Fehler-Injektion kann eine Organisation wissen, dass ihre Resilience-Maßnahmen tatsächlich funktionieren.

Bedrohungskontext

Risiko	Beschreibung
Unbekannte Failure Modes	Resilienz-Lücken werden erst im echten Disaster entdeckt.
Unvalidierte Reliability Claims	Circuit Breaker konfiguriert, aber nie getriggert – unklar ob korrekt.
Ungetestete Recovery	Multi-AZ deployed, aber AZ-Failover-Zeit unbekannt – RTO nicht validiert.
Versteckte Kopplung	Abhängigkeiten, die nur bei spezifischen Fehlermustern sichtbar werden.

Risiko

Beschreibung

Unbekannte Failure Modes

Resilienz-Lücken werden erst im echten Disaster entdeckt.

Unvalidierte Reliability Claims

Circuit Breaker konfiguriert, aber nie getriggert – unklar ob korrekt.

Ungetestete Recovery

Multi-AZ deployed, aber AZ-Failover-Zeit unbekannt – RTO nicht validiert.

Versteckte Kopplung

Abhängigkeiten, die nur bei spezifischen Fehlermustern sichtbar werden.

Anforderung

Hypothesis-driven Experiments: "Wenn X ausfällt, erwarten wir Y innerhalb von Z Sekunden"
Stop Conditions: automatischer Abbruch wenn SLO-Alarm ausgelöst
Staging First: jedes Experiment zuerst in Staging, dann schrittweise in Produktion
Blast Radius Limit: max. 25% der Instanzen beim ersten Experiment einer Art
Ergebnisdokumentation: Hypothese, erwartet, tatsächlich, Actions
Quartalsweise mindestens 3 dokumentierte Experimente pro Team

Implementierungsanleitung

Hypothesen-Liste erstellen: Welche Failure Modes sind für diesen Service relevant?
Staging-Experimente starten: Mit kleinstem Blast Radius beginnen (1 Pod, 10% Instanzen)
Stop Conditions: CloudWatch Alarm oder Prometheus Alert als Abbruchbedingung konfigurieren
AWS FIS verwenden: Experiment Template mit stop_condition Resource konfigurieren
Ergebnisse dokumentieren: YAML-Experiment-Dokument mit Hypothese und Outcome
Actions tracken: Findings in Reliability Debt Register (WAF-REL-100) eintragen

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Keine Chaos-Tests	Resilienz nur durch Produktions-Incidents bekannt.
2	Ad-hoc Tests	Gelegentliche manuelle Tests ohne Dokumentation.
3	Strukturiert + dokumentiert	Hypothesis-driven; quartalsweise; AWS FIS / Azure Chaos Studio; Ergebnisse dokumentiert.
4	Produktions-Chaos kontrolliert	Produktion mit Stop Conditions; GameDay jährlich; Experimente in Release-Pipeline.
5	Kontinuierlich + ML	Automatisierte Low-Blast-Radius Experimente; ML-Anomalieerkennung.

Level

Bezeichnung

Kriterien

Keine Chaos-Tests

Resilienz nur durch Produktions-Incidents bekannt.

Ad-hoc Tests

Gelegentliche manuelle Tests ohne Dokumentation.

Strukturiert + dokumentiert

Hypothesis-driven; quartalsweise; AWS FIS / Azure Chaos Studio; Ergebnisse dokumentiert.

Produktions-Chaos kontrolliert

Produktion mit Stop Conditions; GameDay jährlich; Experimente in Release-Pipeline.

Kontinuierlich + ML

Automatisierte Low-Blast-Radius Experimente; ML-Anomalieerkennung.

Terraform Checks

waf-rel-090.tf.aws.fis-experiment-template

Prüft: AWS FIS Experiment Template hat stop_condition und description.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_fis_experiment_template" "az_failure" { description = "Hypothesis: Service recovers from 25% instance termination in < 60s" role_arn = aws_iam_role.fis.arn stop_condition { source = "aws:cloudwatch:alarm" value = aws_cloudwatch_metric_alarm .slo_burn.arn } action { name = "terminate-instances" action_id = "aws:ec2:terminate-instances" target { key = "Instances" value = "az1-instances" } } # ... target block ... }`	`resource "aws_fis_experiment_template" "az_failure" { description = "AZ test" role_arn = aws_iam_role.fis.arn # Kein stop_condition – # Experiment läuft unkontrolliert # WAF-REL-090 Violation action { name = "terminate-all" action_id = "aws:ec2:terminate-instances" target { key = "Instances" value = "all-instances" } } target { name = "all-instances" resource_type = "aws:ec2:instance" selection_mode = "ALL" } }`

resource "aws_fis_experiment_template"
    "az_failure" {
  description = "Hypothesis: Service
    recovers from 25% instance
    termination in < 60s"
  role_arn = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  =
      aws_cloudwatch_metric_alarm
      .slo_burn.arn
  }

  action {
    name = "terminate-instances"
    action_id =
      "aws:ec2:terminate-instances"
    target { key = "Instances"
             value = "az1-instances" }
  }
  # ... target block ...
}

resource "aws_fis_experiment_template"
    "az_failure" {
  description = "AZ test"
  role_arn = aws_iam_role.fis.arn
  # Kein stop_condition –
  # Experiment läuft unkontrolliert
  # WAF-REL-090 Violation
  action {
    name = "terminate-all"
    action_id =
      "aws:ec2:terminate-instances"
    target { key = "Instances"
             value = "all-instances" }
  }
  target {
    name = "all-instances"
    resource_type = "aws:ec2:instance"
    selection_mode = "ALL"
  }
}

Remediation: stop_condition Block mit CloudWatch Alarm hinzufügen. description mit Hypothesen-Text befüllen.

Evidenz

Typ	Pflicht	Beschreibung
Process	✅ Pflicht	Quartalsweise Chaos-Experiment-Berichte: Hypothese, Erwartung, Ergebnis, Actions.
Governance	✅ Pflicht	Chaos Engineering Charter mit Genehmigungsprozess und Blast-Radius-Limits.
IaC	Optional	AWS FIS Experiment Templates oder Azure Chaos Studio Workflow-Konfigurationen.
Process	Optional	GameDay-Bericht des letzten Jahres.

Typ

Pflicht

Beschreibung

Process

✅ Pflicht

Quartalsweise Chaos-Experiment-Berichte: Hypothese, Erwartung, Ergebnis, Actions.

Governance

✅ Pflicht

Chaos Engineering Charter mit Genehmigungsprozess und Blast-Radius-Limits.

IaC

Optional

AWS FIS Experiment Templates oder Azure Chaos Studio Workflow-Konfigurationen.

Process

Optional

GameDay-Bericht des letzten Jahres.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

Chaos Engineering & Fault Injection →

WAF-REL-090 – Chaos Engineering & Fault Injection

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-090.tf.aws.fis-experiment-template

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice