WAF-REL-030 – Multi-AZ High Availability Deployment
Description
All production workloads MUST be distributed across at least 2 Availability Zones. Single-AZ deployments in production are not permitted without written risk acceptance. Databases MUST configure Multi-AZ with automatic failover. Kubernetes MUST use Topology Spread Constraints for AZ distribution.
Rationale
AZ failures are the most frequent cloud infrastructure disruption type. A system in a single AZ experiences 100% outage during an AZ event. The cost increase for Multi-AZ is negligible compared to a single production outage. Multi-AZ is the absolute minimum standard for productive high-availability systems.
Threat Context
| Risk | Description |
|---|---|
AZ Failure = Total Outage |
Single-AZ deployment: every AZ disruption results in a complete service outage. |
Database Single Point of Failure |
Single-AZ RDS: database unreachable for hours during an AZ failure. |
Kubernetes Pod Concentration |
Without Topology Spread, all pods end up in one AZ: single-pod class as SPOF. |
Automatic Failover Missing |
Multi-AZ configured, but failover not automatic → manual intervention required during AZ failure. |
Requirement
-
All production compute resources: at least 2 AZs
-
Auto Scaling Groups:
min_size >= 2, subnets in min. 2 AZs -
All production databases: Multi-AZ with automatic failover
-
Kubernetes:
topologySpreadConstraintswith zone key configured -
Load balancers: subnets in min. 2 AZs
Implementation Guidance
-
ASG Subnets:
vpc_zone_identifierwith subnets from min. 2 AZs -
ASG Min Size:
min_size = 2– one instance cannot survive an AZ failure -
RDS Multi-AZ:
multi_az = true– synchronous replication, auto failover < 2 minutes -
ElastiCache: Multi-AZ replication group with
automatic_failover_enabled = true -
Kubernetes:
topologySpreadConstraints.topologyKey = topology.kubernetes.io/zone -
Test AZ failover: Terminate instances in one AZ and observe recovery
Maturity Levels
| Level | Name | Criteria |
|---|---|---|
1 |
Single-AZ |
All resources in one AZ; no redundancy. |
2 |
DB Multi-AZ |
Databases Multi-AZ; compute still Single-AZ. |
3 |
Fully Multi-AZ |
Everything in min. 2 AZs; LB and ASG multi-AZ configured; AZ test quarterly. |
4 |
Auto-Failover Tested |
Automatic failover documented and measured; Kubernetes Topology Spread enforced. |
5 |
Multi-Region |
Critical workloads multi-regional; global load balancing with auto region failover. |
Terraform Checks
waf-rel-030.tf.aws.rds-multi-az
Checks: RDS Instance has multi_az = true and deletion_protection = true.
| Compliant | Non-Compliant |
|---|---|
|
|
Remediation: Set multi_az = true and deletion_protection = true on the
aws_db_instance resource.
Evidence
| Type | Required | Description |
|---|---|---|
IaC |
✅ Required |
Terraform with Multi-AZ configuration for compute, DB and load balancer. |
Config |
✅ Required |
Cloud console or IaC shows min. 2 AZs per production resource. |
Process |
Optional |
AZ failover test report with measured recovery time. |
Regulatorisches Mapping
| Framework | Controls |
|---|---|
ISO/IEC 27001:2022 |
A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents |
ITIL 4 |
SVS – Service value system; DP – Design principle; OV – Operation value chain |
AWS Well-Architected Framework |
Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor |
SRE Book (Google) |
Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring |
CNCF Cloud Native Security |
SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials |
BSI C5:2022 |
SIM-01 – Security incident management; SIM-02 – Security information and event management |
GDPR |
Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach |
NIST SP 800-161 |
SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls |
DORA |
Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools |
COBIT 2019 |
DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity |
TISAX |
Information security – Incident response |
ANSSI SecNumCloud |
Domain – Incident response; Domain – Business continuity |
BIO |
BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit |
ENS High |
op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio |
UK NCSC CAF |
D1 – Response and recovery planning; D2 – Lessons learned |
CMMC 2.0 |
IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents |
IRAP |
ISM – Incident management; ISM – Business continuity |
CCCS PBMM |
IR-4 – Incident handling; IR-8 – Incident response plan |
MAS TRM |
Ch.10 – Security incident management; Ch.11 – Business continuity |
ISMAP |
Reliability and incident management |
FISC |
Operational measures – Incident response |