Controls (WAF-REL)

The Reliability pillar is operationalized by 10 measurable controls. Each control has a unique ID in the format WAF-REL-NNN, a severity rating, machine-readable YAML checks and a maturity level breakdown.

The YAML source files are located under modules/controls/controls/WAF-REL-*.yml and can be executed directly by the WAF++ Checker Tool.

Controls Overview

Control ID	Title	Severity	Category
WAF-REL-010	SLA & SLO Definition Documented	Critical	Reliability Governance
WAF-REL-020	Health Checks & Readiness Probes Configured	High	Health Monitoring
WAF-REL-030	Multi-AZ High Availability Deployment	High	High Availability
WAF-REL-040	Backup & Recovery Validation	Critical	Backup Recovery
WAF-REL-050	Circuit Breaker & Timeout Configuration	High	Resilience Patterns
WAF-REL-060	Incident Response & Runbook Readiness	High	Incident Response
WAF-REL-070	Disaster Recovery Testing	High	Disaster Recovery
WAF-REL-080	Dependency & Upstream Resilience Management	Medium	Dependency Management
WAF-REL-090	Chaos Engineering & Fault Injection	Medium	Chaos Engineering
WAF-REL-100	Reliability Debt Register & Quarterly Review	Medium	Reliability Governance

Control ID

Title

Severity

WAF-REL-010 – SLA & SLO Definition Documented

Severity: Critical | Category: Reliability Governance | Automatable: Medium

Every production workload MUST have documented SLOs (availability, latency, error rate). SLOs MUST be monitored in monitoring dashboards with alerting on error budget burn rate.

Requirement: SLO document versioned; error budget calculated; burn rate alerts configured; quarterly review evidenced.

Terraform Checks (Excerpt):

waf-rel-010.tf.aws.cloudwatch-slo-alarm – CloudWatch alarm for SLO availability monitoring
waf-rel-010.tf.azurerm.monitor-metric-alert-slo – Azure Monitor alert for SLO tracking
waf-rel-010.tf.google.monitoring-alert-policy-slo – GCP Monitoring alert with notification channels

Evidence: SLO document per workload (required); monitoring dashboard (required)

Best Practice: SLO & SLA Definition →

WAF-REL-020 – Health Checks & Readiness Probes Configured

Severity: High | Category: Health Monitoring | Automatable: High

All production services MUST expose health check endpoints and configure readiness/liveness probes. Load balancers MUST use health checks with explicit thresholds, intervals and timeouts.

Requirement: No deployment without health check; probes with measured initialDelaySeconds; health check paths validate real dependencies.

Terraform Checks (Excerpt):

waf-rel-020.tf.aws.alb-target-group-health-check – ALB target group health check configured
waf-rel-020.tf.azurerm.lb-probe – Azure LB probe with request_path
waf-rel-020.tf.google.compute-health-check – GCP compute health check with HTTP path

Evidence: IaC with probe configuration (required); LB health check (required)

Best Practice: Health Checks & Probes →

WAF-REL-030 – Multi-AZ High Availability Deployment

Severity: High | Category: High Availability | Automatable: High

All production workloads MUST be distributed across at least 2 Availability Zones. Single-AZ deployments in production are not permitted without written risk acceptance.

Requirement: Min. 2 AZs for compute; Multi-AZ for all databases; LB across 2+ AZs; AZ failover tested at least quarterly.

Terraform Checks (Excerpt):

waf-rel-030.tf.aws.rds-multi-az – RDS multi_az = true
waf-rel-030.tf.aws.autoscaling-multi-az – ASG with min_size >= 2 and multi-AZ subnets
waf-rel-030.tf.azurerm.db-availability-zone – Azure DB with ZoneRedundant HA
waf-rel-030.tf.google.sql-availability-type – Cloud SQL REGIONAL availability

Evidence: IaC with Multi-AZ configuration (required); cloud console screenshot (required)

Best Practice: Multi-AZ & High Availability →

WAF-REL-040 – Backup & Recovery Validation

Severity: Critical | Category: Backup Recovery | Automatable: High

All production databases and stateful services MUST have automated backups with retention >= 7 days and PITR. Backups MUST be stored in a separate account/region. Recovery MUST be tested and documented at least quarterly.

Requirement: PITR enabled; cross-account backup; restore test with results documentation; backup failure alerts configured.

Terraform Checks (Excerpt):

waf-rel-040.tf.aws.rds-backup-retention – RDS backup_retention_period >= 7, deletion_protection = true
waf-rel-040.tf.aws.s3-versioning – S3 versioning = Enabled
waf-rel-040.tf.azurerm.postgresql-backup – Azure DB backup_retention_days >= 7, geo_redundant = true
waf-rel-040.tf.google.sql-backup-config – Cloud SQL backup + PITR enabled

Evidence: IaC backup configuration (required); restore test report (required)

Best Practice: Backup & Recovery →

WAF-REL-050 – Circuit Breaker & Timeout Configuration

Severity: High | Category: Resilience Patterns | Automatable: High

All inter-service and downstream HTTP calls MUST define explicit timeouts. Critical dependencies MUST implement circuit breakers. Retry logic MUST use exponential backoff with jitter.

Requirement: No default timeout for external calls; circuit breaker for critical dependencies; retry maximum 3 attempts with backoff; bulkhead isolation for different classes.

Terraform Checks (Excerpt):

waf-rel-050.tf.aws.alb-idle-timeout – ALB idle_timeout explicitly set
waf-rel-050.tf.azurerm.app-gateway-timeout – App Gateway request_timeout_in_seconds
waf-rel-050.tf.google.cloud-run-timeout – Cloud Run timeout_seconds configured

Evidence: Terraform/service mesh configuration (required); app config files (required)

Best Practice: Circuit Breaker & Timeouts →

WAF-REL-060 – Incident Response & Runbook Readiness

Severity: High | Category: Incident Response | Automatable: Medium

All production workloads MUST have a documented incident response plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST be directly linked from alert notifications. Post-incident reviews within 5 business days.

Requirement: 4 severity levels defined; on-call configured; all critical alerts with runbook link; MTTR tracked; post-mortems for SEV1/SEV2 documented.

Terraform Checks (Excerpt):

waf-rel-060.tf.aws.sns-topic-alarm-action – CloudWatch alarms with alarm_actions and ok_actions
waf-rel-060.tf.azurerm.action-group-configured – Azure Monitor action group with recipients
waf-rel-060.tf.google.monitoring-notification-channel – GCP alert policy with notification_channels

Evidence: Incident response plan (required); post-incident review records (required)

Best Practice: Incident Response & Runbooks →

WAF-REL-070 – Disaster Recovery Testing

Severity: High | Category: Disaster Recovery | Automatable: Partial

All critical production workloads MUST conduct DR tests at least twice a year with documented RTO/RPO results. DR plans MUST be updated after significant architecture changes.

Requirement: DR plan with RTO/RPO targets; semi-annual test; results documentation; deviation ⇒ remediation plan within 30 days.

Terraform Checks (Excerpt):

waf-rel-070.tf.aws.route53-health-check-failover – Route 53 health check failover routing
waf-rel-070.tf.azurerm.traffic-manager-profile – Azure Traffic Manager with failover routing and monitor
waf-rel-070.tf.google.dns-managed-zone-failover – GCP DNS with TTL ⇐ 60s for fast failover

Evidence: DR test reports from last 12 months (required); DR plan (required)

Best Practice: Disaster Recovery →

WAF-REL-080 – Dependency & Upstream Resilience Management

Severity: Medium | Category: Dependency Management | Automatable: Medium

All production workloads MUST maintain an inventoried and classified dependency register. Critical dependencies MUST have circuit breakers. Fallback behavior MUST be defined for all optional dependencies.

Requirement: Dependency register with criticality and SLA; CB for critical dependencies; fallback for optional dependencies; quarterly review.

Terraform Checks (Excerpt):

waf-rel-080.tf.aws.vpc-endpoint-dependency-isolation – AWS VPC endpoints for AWS API isolation
waf-rel-080.tf.azurerm.private-endpoint-dependency – Azure private endpoints for managed services
waf-rel-080.tf.google.vpc-sc-dependency-isolation – GCP private_ip_google_access = true

Evidence: Dependency register (required); circuit breaker configuration (required)

WAF-REL-090 – Chaos Engineering & Fault Injection

Severity: Medium | Category: Chaos Engineering | Automatable: Medium

Production and staging workloads MUST conduct quarterly structured chaos experiments with a hypothesis framework, stop conditions and results documentation.

Requirement: Hypothesis-driven tests; stop conditions configured; experiments in staging before production; blast radius limited; results converted into remediation.

Terraform Checks (Excerpt):

waf-rel-090.tf.aws.fis-experiment-template – AWS FIS with stop_condition configured
waf-rel-090.tf.azurerm.chaos-studio-experiment – Azure Chaos Studio experiment with identity
waf-rel-090.tf.google.fault-injection-test – GCP URL map with fault_injection_policy

Evidence: Chaos experiment reports (required); chaos engineering charter (required)

Best Practice: Chaos Engineering →

WAF-REL-100 – Reliability Debt Register & Quarterly Review

Severity: Medium | Category: Reliability Governance | Automatable: Low–Medium

All known reliability risks and deferred improvements MUST be captured in a versioned Reliability Debt Register with owner, severity and target date. Quarterly review is mandatory.

Requirement: Register version-controlled; all entries with owner, P1–P4 priority and target date; quarterly review with minutes; P1 items addressed within one sprint.

Terraform Checks (Excerpt):

waf-rel-100.tf.aws.config-conformance-pack – AWS Config conformance pack for reliability
waf-rel-100.tf.azurerm.policy-assignment-reliability – Azure policy assignment for reliability controls
waf-rel-100.tf.google.org-policy-reliability – GCP org policy for reliability constraints

Evidence: Reliability Debt Register (required); quarterly review minutes (required)