WAF++ WAF++
Back to WAF++ Homepage

WAF-REL-030 – Multi-AZ High Availability Deployment

Description

All production workloads MUST be distributed across at least 2 Availability Zones. Single-AZ deployments in production are not permitted without written risk acceptance. Databases MUST configure Multi-AZ with automatic failover. Kubernetes MUST use Topology Spread Constraints for AZ distribution.

Rationale

AZ failures are the most frequent cloud infrastructure disruption type. A system in a single AZ experiences 100% outage during an AZ event. The cost increase for Multi-AZ is negligible compared to a single production outage. Multi-AZ is the absolute minimum standard for productive high-availability systems.

Threat Context

Risk Description

AZ Failure = Total Outage

Single-AZ deployment: every AZ disruption results in a complete service outage.

Database Single Point of Failure

Single-AZ RDS: database unreachable for hours during an AZ failure.

Kubernetes Pod Concentration

Without Topology Spread, all pods end up in one AZ: single-pod class as SPOF.

Automatic Failover Missing

Multi-AZ configured, but failover not automatic → manual intervention required during AZ failure.

Requirement

  • All production compute resources: at least 2 AZs

  • Auto Scaling Groups: min_size >= 2, subnets in min. 2 AZs

  • All production databases: Multi-AZ with automatic failover

  • Kubernetes: topologySpreadConstraints with zone key configured

  • Load balancers: subnets in min. 2 AZs

Implementation Guidance

  1. ASG Subnets: vpc_zone_identifier with subnets from min. 2 AZs

  2. ASG Min Size: min_size = 2 – one instance cannot survive an AZ failure

  3. RDS Multi-AZ: multi_az = true – synchronous replication, auto failover < 2 minutes

  4. ElastiCache: Multi-AZ replication group with automatic_failover_enabled = true

  5. Kubernetes: topologySpreadConstraints.topologyKey = topology.kubernetes.io/zone

  6. Test AZ failover: Terminate instances in one AZ and observe recovery

Maturity Levels

Level Name Criteria

1

Single-AZ

All resources in one AZ; no redundancy.

2

DB Multi-AZ

Databases Multi-AZ; compute still Single-AZ.

3

Fully Multi-AZ

Everything in min. 2 AZs; LB and ASG multi-AZ configured; AZ test quarterly.

4

Auto-Failover Tested

Automatic failover documented and measured; Kubernetes Topology Spread enforced.

5

Multi-Region

Critical workloads multi-regional; global load balancing with auto region failover.

Terraform Checks

waf-rel-030.tf.aws.rds-multi-az

Checks: RDS Instance has multi_az = true and deletion_protection = true.

Compliant Non-Compliant
resource "aws_db_instance" "main" {
  identifier        = "payment-db-prod"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  multi_az          = true
  deletion_protection = true
  db_subnet_group_name =
    aws_db_subnet_group.main.name
}
resource "aws_db_instance" "main" {
  identifier        = "payment-db-prod"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  multi_az          = false
  # WAF-REL-030 Violation
}

Remediation: Set multi_az = true and deletion_protection = true on the aws_db_instance resource.

Evidence

Type Required Description

IaC

✅ Required

Terraform with Multi-AZ configuration for compute, DB and load balancer.

Config

✅ Required

Cloud console or IaC shows min. 2 AZs per production resource.

Process

Optional

AZ failover test report with measured recovery time.