Best Practice: Backup, Recovery & Restore Tests
Context
Backup without tested recovery is a security illusion. The most common backup mistakes in practice are not missing backups, but untested recovery procedures that fail in an emergency – because manual steps are missing, keys are unavailable, or the IaC environment of the target account does not exist.
Common problems without a structured backup recovery practice:
-
Backups are configured, but the restore procedure has never been executed
-
Backup encryption keys are in the same account as the data (ransomware)
-
RTO target = 1 hour, but the last restore test took 4 hours
-
PITR enabled, but the restore procedure is not documented
Related Controls
-
WAF-REL-040 – Backup & Recovery Validation
-
WAF-REL-070 – Disaster Recovery Testing
Target State
-
Automated backups with RPO-aligned retention periods
-
Cross-account storage prevents single-account compromise
-
PITR for granular recovery at the transaction level
-
Quarterly tested and documented restore procedure
Technical Implementation
AWS: Cross-Account RDS Backup
# Backup target: Separate AWS account
resource "aws_db_instance" "main" {
identifier = "payment-db-prod"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
backup_retention_period = 14 # 14 days PITR
backup_window = "02:00-03:00"
deletion_protection = true
copy_tags_to_snapshot = true
tags = var.mandatory_tags
}
# AWS Backup Plan for cross-account replication
resource "aws_backup_plan" "main" {
name = "payment-db-backup-plan"
rule {
rule_name = "daily-backup"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 2 * * ? *)"
lifecycle {
delete_after = 90 # 90 days retention
}
# Cross-account copy into backup account
copy_action {
destination_vault_arn = var.backup_account_vault_arn
lifecycle {
delete_after = 90
}
}
}
}
# Backup Vault with WORM protection
resource "aws_backup_vault_lock_configuration" "main" {
backup_vault_name = aws_backup_vault.main.name
min_retention_days = 7
max_retention_days = 90
changeable_for_days = 3 # Compliance mode after 3 days
}
S3 Versioning + Object Lock
resource "aws_s3_bucket" "data" {
bucket = "payment-production-data-${var.account_id}"
tags = var.mandatory_tags
}
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_object_lock_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
default_retention {
mode = "GOVERNANCE"
days = 30 # 30 days deletion protection
}
}
}
# Replication into backup account
resource "aws_s3_bucket_replication_configuration" "data" {
bucket = aws_s3_bucket.data.id
role = aws_iam_role.replication.arn
rule {
id = "backup-account-replication"
status = "Enabled"
destination {
bucket = var.backup_account_bucket_arn
storage_class = "GLACIER_IR"
}
}
}
Azure: Geo-Redundant Database Backup
resource "azurerm_postgresql_flexible_server" "main" {
name = "payment-db-prod"
resource_group_name = azurerm_resource_group.main.name
location = "westeurope"
version = "15"
sku_name = "GP_Standard_D4s_v3"
backup_retention_days = 35 # Maximum Azure value
geo_redundant_backup_enabled = true # Cross-region backup
high_availability {
mode = "ZoneRedundant"
standby_availability_zone = "2"
}
tags = var.mandatory_tags
}
Restore Test Automation (Bash/AWS CLI)
#!/bin/bash
# scripts/restore-test.sh
# Quarterly backup restore test
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
RESTORE_DB_ID="payment-db-restore-test-${TIMESTAMP}"
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
--db-instance-identifier payment-db-prod \
--query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
echo "Using snapshot: ${SNAPSHOT_ID}"
# Restore into isolated test subnet
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "${RESTORE_DB_ID}" \
--db-snapshot-identifier "${SNAPSHOT_ID}" \
--db-instance-class db.t3.micro \
--db-subnet-group-name restore-test-subnet-group \
--no-publicly-accessible
# Wait until available
aws rds wait db-instance-available \
--db-instance-identifier "${RESTORE_DB_ID}"
# Connection test
DB_ENDPOINT=$(aws rds describe-db-instances \
--db-instance-identifier "${RESTORE_DB_ID}" \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
echo "DB Endpoint: ${DB_ENDPOINT}"
# Data integrity test
PSQL_CMD="psql -h ${DB_ENDPOINT} -U admin -d payment_db"
ROW_COUNT=$(${PSQL_CMD} -t -c "SELECT COUNT(*) FROM transactions WHERE created_at > NOW() - INTERVAL '24h'")
echo "Rows in last 24h: ${ROW_COUNT}"
# Measure RTO
END_TIME=$(date +%s)
echo "Restore completed. Elapsed: $((END_TIME - START_TIME))s"
# Cleanup
aws rds delete-db-instance \
--db-instance-identifier "${RESTORE_DB_ID}" \
--skip-final-snapshot
Typical Anti-Patterns
-
Backup in the same account: Ransomware encrypts backups together with production data
-
Retention period = 1 day: Data errors discovered after 48h cannot be remediated
-
Restore test in the same environment as production: Test uses production configuration; in an emergency resources are missing
-
Outdated manual restore guide: Service URLs, secrets and IAM roles have changed