Best Practice: Multi-AZ & High Availability Architecture

Context

An Availability Zone (AZ) is a physically isolated data center within a region. AZ failures are the most frequent cloud infrastructure events and affect production systems multiple times per year. A service deployed in a single AZ experiences a complete outage during an AZ event.

Common problems without Multi-AZ configuration:

RDS Single-AZ: During an AZ failure, unreachable for several minutes to hours
Single-AZ Compute: During an AZ failure, all instances go down simultaneously
Kubernetes without Pod Topology Spread: All pods end up in the same AZ

Related Controls

WAF-REL-030 – Multi-AZ High Availability Deployment

Target State

Every production stack deployed across at least 2 AZs:

Compute: Auto Scaling Group or Kubernetes with AZ distribution
Database: Multi-AZ with automatic failover
Cache: Multi-AZ replication group
Load Balancer: Subnets in min. 2 AZs

Technical Implementation

AWS: Multi-AZ RDS + Auto Scaling Group

# Subnet group across 3 AZs
resource "aws_db_subnet_group" "main" {
  name       = "payment-db-subnet-group"
  subnet_ids = [
    aws_subnet.private_az1.id,
    aws_subnet.private_az2.id,
    aws_subnet.private_az3.id,
  ]
  tags = var.mandatory_tags
}

# Multi-AZ RDS
resource "aws_db_instance" "main" {
  identifier             = "payment-db-prod"
  engine                 = "postgres"
  engine_version         = "15.4"
  instance_class         = "db.t3.medium"
  allocated_storage      = 100
  storage_type           = "gp3"
  multi_az               = true        # Automatic failover in < 2 minutes
  db_subnet_group_name   = aws_db_subnet_group.main.name
  backup_retention_period = 14
  deletion_protection    = true
  tags                   = var.mandatory_tags
}

# Auto Scaling Group across 2 AZs
resource "aws_autoscaling_group" "api" {
  name               = "payment-api-asg"
  min_size           = 2    # Minimum 1 per AZ
  max_size           = 10
  desired_capacity   = 2
  vpc_zone_identifier = [
    aws_subnet.private_az1.id,
    aws_subnet.private_az2.id,
  ]
  target_group_arns = [aws_lb_target_group.api.arn]

  # Enforce even AZ distribution
  capacity_rebalance = true

  tag {
    key                 = "Name"
    value               = "payment-api"
    propagate_at_launch = true
  }
}

# ALB across 2 AZs
resource "aws_lb" "main" {
  name               = "payment-api-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets = [
    aws_subnet.public_az1.id,
    aws_subnet.public_az2.id,  # Minimum 2 AZs
  ]
  tags = var.mandatory_tags
}

Kubernetes: Pod Topology Spread Constraints

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 4
  template:
    spec:
      # Required: distribute pods evenly across AZs
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule   # Deployment fails rather than single-AZ
          labelSelector:
            matchLabels:
              app: payment-api

        # Optional: pods on different nodes within an AZ
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway  # Soft constraint
          labelSelector:
            matchLabels:
              app: payment-api

      # Anti-affinity as fallback
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: ["payment-api"]
              topologyKey: topology.kubernetes.io/zone

Azure: Zone-Redundant PostgreSQL

resource "azurerm_postgresql_flexible_server" "main" {
  name                = "payment-db-prod"
  resource_group_name = azurerm_resource_group.main.name
  location            = "westeurope"
  version             = "15"
  sku_name            = "GP_Standard_D4s_v3"
  storage_mb          = 32768
  availability_zone   = "1"

  # Zone-Redundant High Availability
  high_availability {
    mode                      = "ZoneRedundant"
    standby_availability_zone = "2"    # Separate physical data center
  }

  backup_retention_days        = 14
  geo_redundant_backup_enabled = true

  tags = var.mandatory_tags
}

GCP: Multi-Zone GKE + Cloud SQL

resource "google_container_cluster" "main" {
  name     = "payment-cluster"
  location = var.region  # Regional = Multi-Zone

  node_pool {
    name       = "default-pool"
    node_count = 1  # Per zone

    node_config {
      machine_type = "e2-standard-4"
    }

    # Automatically distributed across all AZs of the region
  }

  # Enable pod anti-affinity default
  workload_metadata_config {
    mode = "GKE_METADATA"
  }
}

resource "google_sql_database_instance" "main" {
  name             = "payment-db-prod"
  database_version = "POSTGRES_15"
  region           = var.region

  settings {
    tier              = "db-custom-4-15360"
    availability_type = "REGIONAL"   # Automatic zone failover

    backup_configuration {
      enabled                        = true
      point_in_time_recovery_enabled = true
      start_time                     = "02:00"
    }
  }
}

Typical Anti-Patterns

Multi-AZ only for DB, Single-AZ for Compute: During an AZ failure, the DB has failover but the app server is gone
Auto Scaling Group without Min Size 2: During an AZ failure, min_size = 1 runs in only one AZ
Kubernetes Pods without Topology Spread: Scheduler places all pods in one AZ (default behavior)
Staging Single-AZ, Production Multi-AZ: Production differences lead to untested failover paths

Metrics

AZ Distribution: % of instances/pods per AZ (target: even ±20%)
Single-AZ Resource Count: Number of production resources in only one AZ (target: 0)
Failover Test Frequency: Number of tested AZ failovers per quarter (target: >= 1)

Maturity Level

Level 1 – All resources in one AZ
Level 2 – Databases Multi-AZ, Compute Single-AZ
Level 3 – Everything Multi-AZ, ASG min_size >= 2, Kubernetes Topology Spread
Level 4 – Automated AZ failover test, AZ distribution metrics tracked
Level 5 – Multi-region Active-Active or Active-Passive with global load balancing