WAF-REL-030 – Multi-AZ High Availability Deployment

Pillar: Reliability | Severity: High | Kategorie: High Availability | Automatisierbar: Hoch

Beschreibung

Alle Produktions-Workloads MÜSSEN über mindestens 2 Availability Zones verteilt sein. Single-AZ-Deployments in Produktion sind ohne schriftliche Risk Acceptance nicht zulässig. Datenbanken MÜSSEN Multi-AZ mit automatischem Failover konfigurieren. Kubernetes MUSS Topology Spread Constraints zur AZ-Verteilung verwenden.

Rationale

AZ-Ausfälle sind der häufigste cloud-infrastrukturelle Störungstyp. Ein System in einer einzigen AZ erleidet bei einem AZ-Ereignis 100% Ausfall. Der Kostenzuwachs für Multi-AZ ist im Vergleich zu einem einzelnen Produktionsausfall vernachlässigbar. Multi-AZ ist der absolute Mindeststandard für produktive High-Availability-Systeme.

Bedrohungskontext

Risiko	Beschreibung
AZ-Ausfall = Totalausfall	Single-AZ Deployment: Jede AZ-Störung führt zu vollständigem Service-Ausfall.
Datenbank Single Point of Failure	Single-AZ RDS: Bei AZ-Ausfall ist Datenbank stundenlang nicht erreichbar.
Kubernetes Pod-Konzentration	Ohne Topology Spread landen alle Pods in einer AZ: Single-Pod-Klasse als SPOF.
Automatisches Failover fehlt	Multi-AZ konfiguriert, aber Failover nicht automatisch → manuelle Intervention bei AZ-Ausfall.

Risiko

Beschreibung

AZ-Ausfall = Totalausfall

Single-AZ Deployment: Jede AZ-Störung führt zu vollständigem Service-Ausfall.

Datenbank Single Point of Failure

Single-AZ RDS: Bei AZ-Ausfall ist Datenbank stundenlang nicht erreichbar.

Kubernetes Pod-Konzentration

Ohne Topology Spread landen alle Pods in einer AZ: Single-Pod-Klasse als SPOF.

Automatisches Failover fehlt

Multi-AZ konfiguriert, aber Failover nicht automatisch → manuelle Intervention bei AZ-Ausfall.

Anforderung

Alle Produktions-Compute-Ressourcen: mindestens 2 AZs
Auto Scaling Groups: min_size >= 2, Subnets in min. 2 AZs
Alle Produktionsdatenbanken: Multi-AZ mit automatischem Failover
Kubernetes: topologySpreadConstraints mit Zone-Key konfiguriert
Load Balancer: Subnets in min. 2 AZs

Implementierungsanleitung

ASG Subnets: vpc_zone_identifier mit Subnets aus min. 2 AZs
ASG Min Size: min_size = 2 – eine Instanz kann AZ-Ausfall nicht überstehen
RDS Multi-AZ: multi_az = true – synchrone Replikation, Auto Failover < 2 Minuten
ElastiCache: Multi-AZ Replication Group mit automatic_failover_enabled = true
Kubernetes: topologySpreadConstraints.topologyKey = topology.kubernetes.io/zone
AZ-Failover testen: Instanzen in einer AZ terminieren und Recovery beobachten

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Single-AZ	Alle Ressourcen in einer AZ; keine Redundanz.
2	DB Multi-AZ	Datenbanken Multi-AZ; Compute noch Single-AZ.
3	Vollständig Multi-AZ	Alles in min. 2 AZs; LB und ASG multi-AZ konfiguriert; AZ-Test quartalsweise.
4	Auto-Failover getestet	Automatischer Failover dokumentiert und gemessen; Kubernetes Topology Spread erzwungen.
5	Multi-Region	Kritische Workloads multi-regional; Global Load Balancing mit Auto-Region-Failover.

Level

Bezeichnung

Kriterien

Single-AZ

Alle Ressourcen in einer AZ; keine Redundanz.

DB Multi-AZ

Datenbanken Multi-AZ; Compute noch Single-AZ.

Vollständig Multi-AZ

Alles in min. 2 AZs; LB und ASG multi-AZ konfiguriert; AZ-Test quartalsweise.

Auto-Failover getestet

Automatischer Failover dokumentiert und gemessen; Kubernetes Topology Spread erzwungen.

Multi-Region

Kritische Workloads multi-regional; Global Load Balancing mit Auto-Region-Failover.

Terraform Checks

waf-rel-030.tf.aws.rds-multi-az

Prüft: RDS Instance hat multi_az = true und deletion_protection = true.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_db_instance" "main" { identifier = "payment-db-prod" engine = "postgres" instance_class = "db.t3.medium" allocated_storage = 100 multi_az = true deletion_protection = true db_subnet_group_name = aws_db_subnet_group.main.name }`	`resource "aws_db_instance" "main" { identifier = "payment-db-prod" engine = "postgres" instance_class = "db.t3.medium" allocated_storage = 100 multi_az = false # WAF-REL-030 Violation }`

resource "aws_db_instance" "main" {
  identifier        = "payment-db-prod"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  multi_az          = true
  deletion_protection = true
  db_subnet_group_name =
    aws_db_subnet_group.main.name
}

resource "aws_db_instance" "main" {
  identifier        = "payment-db-prod"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  multi_az          = false
  # WAF-REL-030 Violation
}

Remediation: multi_az = true und deletion_protection = true auf der aws_db_instance Ressource setzen.

Evidenz

Typ	Pflicht	Beschreibung
IaC	✅ Pflicht	Terraform mit Multi-AZ-Konfiguration für Compute, DB und Load Balancer.
Config	✅ Pflicht	Cloud Console oder IaC zeigt min. 2 AZs je Produktionsressource.
Process	Optional	AZ-Failover-Testbericht mit gemessener Recovery-Zeit.

Typ

Pflicht

Beschreibung

IaC

✅ Pflicht

Terraform mit Multi-AZ-Konfiguration für Compute, DB und Load Balancer.

Config

✅ Pflicht

Cloud Console oder IaC zeigt min. 2 AZs je Produktionsressource.

Process

Optional

AZ-Failover-Testbericht mit gemessener Recovery-Zeit.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

Multi-AZ & High-Availability-Architektur →

WAF-REL-030 – Multi-AZ High Availability Deployment

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-030.tf.aws.rds-multi-az

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice