Best Practice: Multi-AZ & High Availability Architecture
Context
An Availability Zone (AZ) is a physically isolated data center within a region. AZ failures are the most frequent cloud infrastructure events and affect production systems multiple times per year. A service deployed in a single AZ experiences a complete outage during an AZ event.
Common problems without Multi-AZ configuration:
-
RDS Single-AZ: During an AZ failure, unreachable for several minutes to hours
-
Single-AZ Compute: During an AZ failure, all instances go down simultaneously
-
Kubernetes without Pod Topology Spread: All pods end up in the same AZ
Related Controls
-
WAF-REL-030 – Multi-AZ High Availability Deployment
Target State
Every production stack deployed across at least 2 AZs:
-
Compute: Auto Scaling Group or Kubernetes with AZ distribution
-
Database: Multi-AZ with automatic failover
-
Cache: Multi-AZ replication group
-
Load Balancer: Subnets in min. 2 AZs
Technical Implementation
AWS: Multi-AZ RDS + Auto Scaling Group
# Subnet group across 3 AZs
resource "aws_db_subnet_group" "main" {
name = "payment-db-subnet-group"
subnet_ids = [
aws_subnet.private_az1.id,
aws_subnet.private_az2.id,
aws_subnet.private_az3.id,
]
tags = var.mandatory_tags
}
# Multi-AZ RDS
resource "aws_db_instance" "main" {
identifier = "payment-db-prod"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
storage_type = "gp3"
multi_az = true # Automatic failover in < 2 minutes
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 14
deletion_protection = true
tags = var.mandatory_tags
}
# Auto Scaling Group across 2 AZs
resource "aws_autoscaling_group" "api" {
name = "payment-api-asg"
min_size = 2 # Minimum 1 per AZ
max_size = 10
desired_capacity = 2
vpc_zone_identifier = [
aws_subnet.private_az1.id,
aws_subnet.private_az2.id,
]
target_group_arns = [aws_lb_target_group.api.arn]
# Enforce even AZ distribution
capacity_rebalance = true
tag {
key = "Name"
value = "payment-api"
propagate_at_launch = true
}
}
# ALB across 2 AZs
resource "aws_lb" "main" {
name = "payment-api-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = [
aws_subnet.public_az1.id,
aws_subnet.public_az2.id, # Minimum 2 AZs
]
tags = var.mandatory_tags
}
Kubernetes: Pod Topology Spread Constraints
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 4
template:
spec:
# Required: distribute pods evenly across AZs
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # Deployment fails rather than single-AZ
labelSelector:
matchLabels:
app: payment-api
# Optional: pods on different nodes within an AZ
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Soft constraint
labelSelector:
matchLabels:
app: payment-api
# Anti-affinity as fallback
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["payment-api"]
topologyKey: topology.kubernetes.io/zone
Azure: Zone-Redundant PostgreSQL
resource "azurerm_postgresql_flexible_server" "main" {
name = "payment-db-prod"
resource_group_name = azurerm_resource_group.main.name
location = "westeurope"
version = "15"
sku_name = "GP_Standard_D4s_v3"
storage_mb = 32768
availability_zone = "1"
# Zone-Redundant High Availability
high_availability {
mode = "ZoneRedundant"
standby_availability_zone = "2" # Separate physical data center
}
backup_retention_days = 14
geo_redundant_backup_enabled = true
tags = var.mandatory_tags
}
GCP: Multi-Zone GKE + Cloud SQL
resource "google_container_cluster" "main" {
name = "payment-cluster"
location = var.region # Regional = Multi-Zone
node_pool {
name = "default-pool"
node_count = 1 # Per zone
node_config {
machine_type = "e2-standard-4"
}
# Automatically distributed across all AZs of the region
}
# Enable pod anti-affinity default
workload_metadata_config {
mode = "GKE_METADATA"
}
}
resource "google_sql_database_instance" "main" {
name = "payment-db-prod"
database_version = "POSTGRES_15"
region = var.region
settings {
tier = "db-custom-4-15360"
availability_type = "REGIONAL" # Automatic zone failover
backup_configuration {
enabled = true
point_in_time_recovery_enabled = true
start_time = "02:00"
}
}
}
Typical Anti-Patterns
-
Multi-AZ only for DB, Single-AZ for Compute: During an AZ failure, the DB has failover but the app server is gone
-
Auto Scaling Group without Min Size 2: During an AZ failure, min_size = 1 runs in only one AZ
-
Kubernetes Pods without Topology Spread: Scheduler places all pods in one AZ (default behavior)
-
Staging Single-AZ, Production Multi-AZ: Production differences lead to untested failover paths
Metrics
-
AZ Distribution: % of instances/pods per AZ (target: even ±20%)
-
Single-AZ Resource Count: Number of production resources in only one AZ (target: 0)
-
Failover Test Frequency: Number of tested AZ failovers per quarter (target: >= 1)
Maturity Level
Level 1 – All resources in one AZ
Level 2 – Databases Multi-AZ, Compute Single-AZ
Level 3 – Everything Multi-AZ, ASG min_size >= 2, Kubernetes Topology Spread
Level 4 – Automated AZ failover test, AZ distribution metrics tracked
Level 5 – Multi-region Active-Active or Active-Passive with global load balancing