WAF-REL-030 – Multi-AZ High Availability Deployment

Pillar: Reliability | Severity: High | Category: High Availability | Automatable: High

Description

All production workloads MUST be distributed across at least 2 Availability Zones. Single-AZ deployments in production are not permitted without written risk acceptance. Databases MUST configure Multi-AZ with automatic failover. Kubernetes MUST use Topology Spread Constraints for AZ distribution.

Rationale

AZ failures are the most frequent cloud infrastructure disruption type. A system in a single AZ experiences 100% outage during an AZ event. The cost increase for Multi-AZ is negligible compared to a single production outage. Multi-AZ is the absolute minimum standard for productive high-availability systems.

Threat Context

Risk	Description
AZ Failure = Total Outage	Single-AZ deployment: every AZ disruption results in a complete service outage.
Database Single Point of Failure	Single-AZ RDS: database unreachable for hours during an AZ failure.
Kubernetes Pod Concentration	Without Topology Spread, all pods end up in one AZ: single-pod class as SPOF.
Automatic Failover Missing	Multi-AZ configured, but failover not automatic → manual intervention required during AZ failure.

Risk

Description

AZ Failure = Total Outage

Single-AZ deployment: every AZ disruption results in a complete service outage.

Database Single Point of Failure

Single-AZ RDS: database unreachable for hours during an AZ failure.

Kubernetes Pod Concentration

Without Topology Spread, all pods end up in one AZ: single-pod class as SPOF.

Automatic Failover Missing

Multi-AZ configured, but failover not automatic → manual intervention required during AZ failure.

Requirement

All production compute resources: at least 2 AZs
Auto Scaling Groups: min_size >= 2, subnets in min. 2 AZs
All production databases: Multi-AZ with automatic failover
Kubernetes: topologySpreadConstraints with zone key configured
Load balancers: subnets in min. 2 AZs

Implementation Guidance

ASG Subnets: vpc_zone_identifier with subnets from min. 2 AZs
ASG Min Size: min_size = 2 – one instance cannot survive an AZ failure
RDS Multi-AZ: multi_az = true – synchronous replication, auto failover < 2 minutes
ElastiCache: Multi-AZ replication group with automatic_failover_enabled = true
Kubernetes: topologySpreadConstraints.topologyKey = topology.kubernetes.io/zone
Test AZ failover: Terminate instances in one AZ and observe recovery

Maturity Levels

Level	Name	Criteria
1	Single-AZ	All resources in one AZ; no redundancy.
2	DB Multi-AZ	Databases Multi-AZ; compute still Single-AZ.
3	Fully Multi-AZ	Everything in min. 2 AZs; LB and ASG multi-AZ configured; AZ test quarterly.
4	Auto-Failover Tested	Automatic failover documented and measured; Kubernetes Topology Spread enforced.
5	Multi-Region	Critical workloads multi-regional; global load balancing with auto region failover.

Level

Name

Criteria

Single-AZ

All resources in one AZ; no redundancy.

DB Multi-AZ

Databases Multi-AZ; compute still Single-AZ.

Fully Multi-AZ

Everything in min. 2 AZs; LB and ASG multi-AZ configured; AZ test quarterly.

Auto-Failover Tested

Automatic failover documented and measured; Kubernetes Topology Spread enforced.

Multi-Region

Critical workloads multi-regional; global load balancing with auto region failover.

Terraform Checks

waf-rel-030.tf.aws.rds-multi-az

Checks: RDS Instance has multi_az = true and deletion_protection = true.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_db_instance" "main" { identifier = "payment-db-prod" engine = "postgres" instance_class = "db.t3.medium" allocated_storage = 100 multi_az = true deletion_protection = true db_subnet_group_name = aws_db_subnet_group.main.name }`	`resource "aws_db_instance" "main" { identifier = "payment-db-prod" engine = "postgres" instance_class = "db.t3.medium" allocated_storage = 100 multi_az = false # WAF-REL-030 Violation }`

resource "aws_db_instance" "main" {
  identifier        = "payment-db-prod"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  multi_az          = true
  deletion_protection = true
  db_subnet_group_name =
    aws_db_subnet_group.main.name
}

resource "aws_db_instance" "main" {
  identifier        = "payment-db-prod"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  multi_az          = false
  # WAF-REL-030 Violation
}

Remediation: Set multi_az = true and deletion_protection = true on the aws_db_instance resource.

Evidence

Type	Required	Description
IaC	✅ Required	Terraform with Multi-AZ configuration for compute, DB and load balancer.
Config	✅ Required	Cloud console or IaC shows min. 2 AZs per production resource.
Process	Optional	AZ failover test report with measured recovery time.

Type

Required

Description

IaC

✅ Required

Terraform with Multi-AZ configuration for compute, DB and load balancer.

Config

✅ Required

Cloud console or IaC shows min. 2 AZs per production resource.

Process

Optional

AZ failover test report with measured recovery time.

Related Controls

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

WAF-REL-030 – Multi-AZ High Availability Deployment

Description

Rationale

Threat Context

Requirement

Implementation Guidance

Maturity Levels

Terraform Checks

waf-rel-030.tf.aws.rds-multi-az

Evidence

Related Controls

Regulatorisches Mapping

Best Practice