WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Backup, Recovery & Restore Tests

Context

Backup without tested recovery is a security illusion. The most common backup mistakes in practice are not missing backups, but untested recovery procedures that fail in an emergency – because manual steps are missing, keys are unavailable, or the IaC environment of the target account does not exist.

Common problems without a structured backup recovery practice:

  • Backups are configured, but the restore procedure has never been executed

  • Backup encryption keys are in the same account as the data (ransomware)

  • RTO target = 1 hour, but the last restore test took 4 hours

  • PITR enabled, but the restore procedure is not documented

Target State

  • Automated backups with RPO-aligned retention periods

  • Cross-account storage prevents single-account compromise

  • PITR for granular recovery at the transaction level

  • Quarterly tested and documented restore procedure

Technical Implementation

AWS: Cross-Account RDS Backup

# Backup target: Separate AWS account
resource "aws_db_instance" "main" {
  identifier              = "payment-db-prod"
  engine                  = "postgres"
  engine_version          = "15.4"
  instance_class          = "db.t3.medium"
  allocated_storage       = 100
  backup_retention_period = 14          # 14 days PITR
  backup_window           = "02:00-03:00"
  deletion_protection     = true
  copy_tags_to_snapshot   = true

  tags = var.mandatory_tags
}

# AWS Backup Plan for cross-account replication
resource "aws_backup_plan" "main" {
  name = "payment-db-backup-plan"

  rule {
    rule_name         = "daily-backup"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 2 * * ? *)"

    lifecycle {
      delete_after = 90  # 90 days retention
    }

    # Cross-account copy into backup account
    copy_action {
      destination_vault_arn = var.backup_account_vault_arn
      lifecycle {
        delete_after = 90
      }
    }
  }
}

# Backup Vault with WORM protection
resource "aws_backup_vault_lock_configuration" "main" {
  backup_vault_name   = aws_backup_vault.main.name
  min_retention_days  = 7
  max_retention_days  = 90
  changeable_for_days = 3  # Compliance mode after 3 days
}

S3 Versioning + Object Lock

resource "aws_s3_bucket" "data" {
  bucket = "payment-production-data-${var.account_id}"
  tags   = var.mandatory_tags
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_object_lock_configuration" "data" {
  bucket = aws_s3_bucket.data.id

  rule {
    default_retention {
      mode = "GOVERNANCE"
      days = 30  # 30 days deletion protection
    }
  }
}

# Replication into backup account
resource "aws_s3_bucket_replication_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "backup-account-replication"
    status = "Enabled"

    destination {
      bucket        = var.backup_account_bucket_arn
      storage_class = "GLACIER_IR"
    }
  }
}

Azure: Geo-Redundant Database Backup

resource "azurerm_postgresql_flexible_server" "main" {
  name                         = "payment-db-prod"
  resource_group_name          = azurerm_resource_group.main.name
  location                     = "westeurope"
  version                      = "15"
  sku_name                     = "GP_Standard_D4s_v3"
  backup_retention_days        = 35    # Maximum Azure value
  geo_redundant_backup_enabled = true  # Cross-region backup

  high_availability {
    mode                      = "ZoneRedundant"
    standby_availability_zone = "2"
  }

  tags = var.mandatory_tags
}

Restore Test Automation (Bash/AWS CLI)

#!/bin/bash
# scripts/restore-test.sh
# Quarterly backup restore test

set -euo pipefail

TIMESTAMP=$(date +%Y%m%d-%H%M%S)
RESTORE_DB_ID="payment-db-restore-test-${TIMESTAMP}"
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
  --db-instance-identifier payment-db-prod \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

echo "Using snapshot: ${SNAPSHOT_ID}"

# Restore into isolated test subnet
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "${RESTORE_DB_ID}" \
  --db-snapshot-identifier "${SNAPSHOT_ID}" \
  --db-instance-class db.t3.micro \
  --db-subnet-group-name restore-test-subnet-group \
  --no-publicly-accessible

# Wait until available
aws rds wait db-instance-available \
  --db-instance-identifier "${RESTORE_DB_ID}"

# Connection test
DB_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier "${RESTORE_DB_ID}" \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

echo "DB Endpoint: ${DB_ENDPOINT}"

# Data integrity test
PSQL_CMD="psql -h ${DB_ENDPOINT} -U admin -d payment_db"
ROW_COUNT=$(${PSQL_CMD} -t -c "SELECT COUNT(*) FROM transactions WHERE created_at > NOW() - INTERVAL '24h'")
echo "Rows in last 24h: ${ROW_COUNT}"

# Measure RTO
END_TIME=$(date +%s)
echo "Restore completed. Elapsed: $((END_TIME - START_TIME))s"

# Cleanup
aws rds delete-db-instance \
  --db-instance-identifier "${RESTORE_DB_ID}" \
  --skip-final-snapshot

Typical Anti-Patterns

  • Backup in the same account: Ransomware encrypts backups together with production data

  • Retention period = 1 day: Data errors discovered after 48h cannot be remediated

  • Restore test in the same environment as production: Test uses production configuration; in an emergency resources are missing

  • Outdated manual restore guide: Service URLs, secrets and IAM roles have changed

Metrics

  • Backup Success Rate: % of scheduled backup jobs that succeeded (target: 100%)

  • Restore Test RTO: Actual time to recovery in the last restore test

  • Data Integrity Score: % of validated data points after restore (target: 100%)

  • Backup Age: Age of the newest available backup (target: < RPO)

Maturity Level

Level 1 – No backups or ad-hoc snapshots
Level 2 – Automated backups, never tested
Level 3 – PITR, cross-account, restore quarterly tested and documented
Level 4 – Automated monthly restore test in pipeline
Level 5 – WORM backups, CDP, continuous backup integrity validation