Controls (WAF-REL)
The Reliability pillar is operationalized by 10 measurable controls.
Each control has a unique ID in the format WAF-REL-NNN, a severity rating,
machine-readable YAML checks and a maturity level breakdown.
The YAML source files are located under modules/controls/controls/WAF-REL-*.yml
and can be executed directly by the WAF++ Checker Tool.
Controls Overview
| Control ID | Title | Severity | Category |
|---|---|---|---|
SLA & SLO Definition Documented |
Critical |
Reliability Governance |
|
Health Checks & Readiness Probes Configured |
High |
Health Monitoring |
|
Multi-AZ High Availability Deployment |
High |
High Availability |
|
Backup & Recovery Validation |
Critical |
Backup Recovery |
|
Circuit Breaker & Timeout Configuration |
High |
Resilience Patterns |
|
Incident Response & Runbook Readiness |
High |
Incident Response |
|
Disaster Recovery Testing |
High |
Disaster Recovery |
|
Dependency & Upstream Resilience Management |
Medium |
Dependency Management |
|
Chaos Engineering & Fault Injection |
Medium |
Chaos Engineering |
|
Reliability Debt Register & Quarterly Review |
Medium |
Reliability Governance |
WAF-REL-010 – SLA & SLO Definition Documented
Severity: Critical | Category: Reliability Governance | Automatable: Medium
Every production workload MUST have documented SLOs (availability, latency, error rate). SLOs MUST be monitored in monitoring dashboards with alerting on error budget burn rate.
Requirement: SLO document versioned; error budget calculated; burn rate alerts configured; quarterly review evidenced.
Terraform Checks (Excerpt):
-
waf-rel-010.tf.aws.cloudwatch-slo-alarm– CloudWatch alarm for SLO availability monitoring -
waf-rel-010.tf.azurerm.monitor-metric-alert-slo– Azure Monitor alert for SLO tracking -
waf-rel-010.tf.google.monitoring-alert-policy-slo– GCP Monitoring alert with notification channels
Evidence: SLO document per workload (required); monitoring dashboard (required)
Best Practice: SLO & SLA Definition →
WAF-REL-020 – Health Checks & Readiness Probes Configured
Severity: High | Category: Health Monitoring | Automatable: High
All production services MUST expose health check endpoints and configure readiness/liveness probes. Load balancers MUST use health checks with explicit thresholds, intervals and timeouts.
Requirement: No deployment without health check; probes with measured initialDelaySeconds; health check paths validate real dependencies.
Terraform Checks (Excerpt):
-
waf-rel-020.tf.aws.alb-target-group-health-check– ALB target group health check configured -
waf-rel-020.tf.azurerm.lb-probe– Azure LB probe with request_path -
waf-rel-020.tf.google.compute-health-check– GCP compute health check with HTTP path
Evidence: IaC with probe configuration (required); LB health check (required)
Best Practice: Health Checks & Probes →
WAF-REL-030 – Multi-AZ High Availability Deployment
Severity: High | Category: High Availability | Automatable: High
All production workloads MUST be distributed across at least 2 Availability Zones. Single-AZ deployments in production are not permitted without written risk acceptance.
Requirement: Min. 2 AZs for compute; Multi-AZ for all databases; LB across 2+ AZs; AZ failover tested at least quarterly.
Terraform Checks (Excerpt):
-
waf-rel-030.tf.aws.rds-multi-az– RDS multi_az = true -
waf-rel-030.tf.aws.autoscaling-multi-az– ASG with min_size >= 2 and multi-AZ subnets -
waf-rel-030.tf.azurerm.db-availability-zone– Azure DB with ZoneRedundant HA -
waf-rel-030.tf.google.sql-availability-type– Cloud SQL REGIONAL availability
Evidence: IaC with Multi-AZ configuration (required); cloud console screenshot (required)
Best Practice: Multi-AZ & High Availability →
WAF-REL-040 – Backup & Recovery Validation
Severity: Critical | Category: Backup Recovery | Automatable: High
All production databases and stateful services MUST have automated backups with retention >= 7 days and PITR. Backups MUST be stored in a separate account/region. Recovery MUST be tested and documented at least quarterly.
Requirement: PITR enabled; cross-account backup; restore test with results documentation; backup failure alerts configured.
Terraform Checks (Excerpt):
-
waf-rel-040.tf.aws.rds-backup-retention– RDS backup_retention_period >= 7, deletion_protection = true -
waf-rel-040.tf.aws.s3-versioning– S3 versioning = Enabled -
waf-rel-040.tf.azurerm.postgresql-backup– Azure DB backup_retention_days >= 7, geo_redundant = true -
waf-rel-040.tf.google.sql-backup-config– Cloud SQL backup + PITR enabled
Evidence: IaC backup configuration (required); restore test report (required)
Best Practice: Backup & Recovery →
WAF-REL-050 – Circuit Breaker & Timeout Configuration
Severity: High | Category: Resilience Patterns | Automatable: High
All inter-service and downstream HTTP calls MUST define explicit timeouts. Critical dependencies MUST implement circuit breakers. Retry logic MUST use exponential backoff with jitter.
Requirement: No default timeout for external calls; circuit breaker for critical dependencies; retry maximum 3 attempts with backoff; bulkhead isolation for different classes.
Terraform Checks (Excerpt):
-
waf-rel-050.tf.aws.alb-idle-timeout– ALB idle_timeout explicitly set -
waf-rel-050.tf.azurerm.app-gateway-timeout– App Gateway request_timeout_in_seconds -
waf-rel-050.tf.google.cloud-run-timeout– Cloud Run timeout_seconds configured
Evidence: Terraform/service mesh configuration (required); app config files (required)
Best Practice: Circuit Breaker & Timeouts →
WAF-REL-060 – Incident Response & Runbook Readiness
Severity: High | Category: Incident Response | Automatable: Medium
All production workloads MUST have a documented incident response plan with severity definitions, escalation paths and on-call rotation. Runbooks MUST be directly linked from alert notifications. Post-incident reviews within 5 business days.
Requirement: 4 severity levels defined; on-call configured; all critical alerts with runbook link; MTTR tracked; post-mortems for SEV1/SEV2 documented.
Terraform Checks (Excerpt):
-
waf-rel-060.tf.aws.sns-topic-alarm-action– CloudWatch alarms with alarm_actions and ok_actions -
waf-rel-060.tf.azurerm.action-group-configured– Azure Monitor action group with recipients -
waf-rel-060.tf.google.monitoring-notification-channel– GCP alert policy with notification_channels
Evidence: Incident response plan (required); post-incident review records (required)
Best Practice: Incident Response & Runbooks →
WAF-REL-070 – Disaster Recovery Testing
Severity: High | Category: Disaster Recovery | Automatable: Partial
All critical production workloads MUST conduct DR tests at least twice a year with documented RTO/RPO results. DR plans MUST be updated after significant architecture changes.
Requirement: DR plan with RTO/RPO targets; semi-annual test; results documentation; deviation ⇒ remediation plan within 30 days.
Terraform Checks (Excerpt):
-
waf-rel-070.tf.aws.route53-health-check-failover– Route 53 health check failover routing -
waf-rel-070.tf.azurerm.traffic-manager-profile– Azure Traffic Manager with failover routing and monitor -
waf-rel-070.tf.google.dns-managed-zone-failover– GCP DNS with TTL ⇐ 60s for fast failover
Evidence: DR test reports from last 12 months (required); DR plan (required)
Best Practice: Disaster Recovery →
WAF-REL-080 – Dependency & Upstream Resilience Management
Severity: Medium | Category: Dependency Management | Automatable: Medium
All production workloads MUST maintain an inventoried and classified dependency register. Critical dependencies MUST have circuit breakers. Fallback behavior MUST be defined for all optional dependencies.
Requirement: Dependency register with criticality and SLA; CB for critical dependencies; fallback for optional dependencies; quarterly review.
Terraform Checks (Excerpt):
-
waf-rel-080.tf.aws.vpc-endpoint-dependency-isolation– AWS VPC endpoints for AWS API isolation -
waf-rel-080.tf.azurerm.private-endpoint-dependency– Azure private endpoints for managed services -
waf-rel-080.tf.google.vpc-sc-dependency-isolation– GCP private_ip_google_access = true
Evidence: Dependency register (required); circuit breaker configuration (required)
WAF-REL-090 – Chaos Engineering & Fault Injection
Severity: Medium | Category: Chaos Engineering | Automatable: Medium
Production and staging workloads MUST conduct quarterly structured chaos experiments with a hypothesis framework, stop conditions and results documentation.
Requirement: Hypothesis-driven tests; stop conditions configured; experiments in staging before production; blast radius limited; results converted into remediation.
Terraform Checks (Excerpt):
-
waf-rel-090.tf.aws.fis-experiment-template– AWS FIS with stop_condition configured -
waf-rel-090.tf.azurerm.chaos-studio-experiment– Azure Chaos Studio experiment with identity -
waf-rel-090.tf.google.fault-injection-test– GCP URL map with fault_injection_policy
Evidence: Chaos experiment reports (required); chaos engineering charter (required)
Best Practice: Chaos Engineering →
WAF-REL-100 – Reliability Debt Register & Quarterly Review
Severity: Medium | Category: Reliability Governance | Automatable: Low–Medium
All known reliability risks and deferred improvements MUST be captured in a versioned Reliability Debt Register with owner, severity and target date. Quarterly review is mandatory.
Requirement: Register version-controlled; all entries with owner, P1–P4 priority and target date; quarterly review with minutes; P1 items addressed within one sprint.
Terraform Checks (Excerpt):
-
waf-rel-100.tf.aws.config-conformance-pack– AWS Config conformance pack for reliability -
waf-rel-100.tf.azurerm.policy-assignment-reliability– Azure policy assignment for reliability controls -
waf-rel-100.tf.google.org-policy-reliability– GCP org policy for reliability constraints
Evidence: Reliability Debt Register (required); quarterly review minutes (required)