Controls (WAF-OPS)
This page provides a narrative overview of all 10 controls of the Operational Excellence pillar. For complete Terraform examples and detailed implementation guidance see the Best Practices.
WAF-OPS-010 – CI/CD Pipeline Defined & Automated
Severity: High | Category: Deployment Automation | Automatable: High
Intent
Every production workload must have a defined, versioned CI/CD pipeline. Manual deployments to production are not permitted.
Requirements
-
Pipeline definitions MUST be stored in version control
-
All deployments to all environments MUST be automated
-
Branch protection MUST prevent direct commits to production branches
-
Approval gates MUST be configured before production deployments
Terraform Checks (Excerpt)
-
waf-ops-010.tf.aws.codepipeline-exists– AWS CodePipeline with Source, Build and Deploy stage -
waf-ops-010.tf.azurerm.devops-pipeline-exists– Azure DevOps Build Definition with repository reference -
waf-ops-010.tf.google.cloudbuild-trigger-exists– GCP Cloud Build Trigger with build configuration file
Evidence
-
Pipeline definition files in version control
-
Branch protection configuration
Related Best Practice: Building and Securing the CI/CD Pipeline
WAF-OPS-020 – Infrastructure as Code Enforced
Severity: High | Category: Infrastructure as Code | Automatable: High
Intent
All cloud infrastructure must be defined as code. Manual resource creation via cloud console is forbidden in production and staging.
Requirements
-
All production and staging infrastructure MUST be defined as IaC
-
Manual changes MUST be restricted via IAM/SCP policy
-
Remote state backend with locking MUST be configured
-
All IaC changes MUST go through pull request review
Terraform Checks (Excerpt)
-
waf-ops-020.tf.aws.s3-terraform-state-backend– Terraform remote state in S3 backend -
waf-ops-020.tf.aws.s3-state-bucket-versioning– S3 state bucket with versioning enabled -
waf-ops-020.tf.azurerm.terraform-state-storage– Azure Storage Account for state with blob versioning
Evidence
-
Terraform repository with remote state configuration
-
S3/Azure Storage state backend with versioning
Related Best Practice: Implementing Infrastructure as Code Consistently
WAF-OPS-030 – Observability Stack Configured
Severity: High | Category: Observability | Automatable: High
Intent
Every production workload must emit structured logs, expose metrics, and support distributed tracing. A centralized observability stack must be configured.
Requirements
-
All services MUST emit structured JSON logs with trace ID
-
Distributed tracing MUST be configured and instrumented (OpenTelemetry, X-Ray)
-
RED metrics MUST be exported for every service
-
Log retention MUST be at least 30 days (recommended: 90 days application, 365 days audit)
Terraform Checks (Excerpt)
-
waf-ops-030.tf.aws.cloudwatch-log-group-retention– CloudWatch Log Groups with retention policy -
waf-ops-030.tf.aws.xray-tracing-enabled– Lambda Functions with active X-Ray tracing -
waf-ops-030.tf.azurerm.log-analytics-workspace– Azure Log Analytics Workspace with retention
Evidence
-
Log group configuration with retention settings
-
Tracing configuration in application code
Related Best Practice: Building the Observability Stack
WAF-OPS-040 – Alerting on Symptoms, Not Causes
Severity: High | Category: Alerting | Automatable: High
Intent
All production alerts must be based on user-visible symptoms (error rate, latency, availability), not internal causes (CPU, memory). Every paging alert requires a runbook.
Requirements
-
Alerts MUST be symptom-based (error rate, latency, availability)
-
Every paging alert MUST have a runbook URL in its description
-
SLOs MUST be defined for all critical services
-
Alert noise metric MUST be tracked (goal: < 10 pages/week/engineer)
Terraform Checks (Excerpt)
-
waf-ops-040.tf.aws.cloudwatch-alarm-symptom-based– CloudWatch Alarm with description and alarm action -
waf-ops-040.tf.azurerm.monitor-alert-symptom– Azure Monitor Alert with description and action group -
waf-ops-040.tf.google.monitoring-alert-symptom– GCP Monitoring Alert Policy with notification channels
Evidence
-
Alert rule definitions with symptom-based metrics
-
SLO definitions for critical services
Related Best Practice: Alerting on Symptoms Instead of Causes
WAF-OPS-050 – Change Management & Deployment Risk Assessment
Severity: Medium | Category: Change Management | Automatable: Medium
Intent
All production changes must go through a defined change management process with risk assessment, approval workflow, and post-deployment verification.
Requirements
-
Change categories MUST be defined (Standard, Normal, Emergency)
-
High-risk changes MUST require multi-person approval
-
Deployment freeze policies MUST be configured for critical periods
-
Change records MUST be linked to deployment artifacts
Terraform Checks (Excerpt)
-
waf-ops-050.tf.aws.codepipeline-manual-approval– CodePipeline with manual approval stage before production -
waf-ops-050.tf.azurerm.devops-environment-approval– Azure DevOps Environment with approval checks
Evidence
-
Change management policy
-
Branch protection and approval configuration
Related Best Practice: Building and Securing the CI/CD Pipeline
WAF-OPS-060 – Runbook & Operational Documentation Coverage
Severity: Medium | Category: Documentation | Automatable: Low
Intent
Every production workload must have runbooks for all known failure scenarios. Runbooks must be versioned, regularly reviewed, and linked to alerts.
Requirements
-
All paging alerts MUST be linked to runbooks
-
Runbooks MUST be stored in version control and reviewed regularly (quarterly)
-
Runbook coverage metric MUST be tracked (goal: >= 90% for critical services)
-
Runbooks MUST be accessible to on-call engineers without authentication barriers
Terraform Checks (Excerpt)
-
waf-ops-060.tf.aws.cloudwatch-alarm-runbook-annotation– CloudWatch Alarms with runbook URL in description -
waf-ops-060.tf.aws.prometheus-alert-runbook-label– Prometheus Alert Rules withrunbook_urlannotation
Evidence
-
Runbook directory with version history
-
Runbook coverage report
Related Best Practice: Maintaining Runbooks and Operational Documentation
WAF-OPS-070 – Post-Incident Review Process
Severity: Medium | Category: Incident Learning | Automatable: Low
Intent
Every production incident with user impact or SLO violation must trigger a blameless postmortem within 5 business days. Action items are tracked and resolved.
Requirements
-
All SEV-1/P1 incidents and SLO violations MUST trigger a postmortem
-
Postmortems MUST be blameless and produce action items with owners
-
Action items MUST be tracked in JIRA or equivalent
-
Postmortems MUST be completed within 5 business days
Terraform Checks (Excerpt)
-
waf-ops-070.tf.aws.incident-management-sns-topic– SNS Topic for incident notifications -
waf-ops-070.tf.azurerm.action-group-incident– Azure Monitor Action Group with configured recipients
Evidence
-
Post-Incident Review policy
-
Postmortem archive (last 3 months)
Related Best Practice: Blameless Postmortems and Continuous Learning
WAF-OPS-080 – Feature Flag & Safe Deployment Patterns
Severity: Medium | Category: Deployment Safety | Automatable: High
Intent
Production deployments must use Progressive Delivery patterns (Canary, Blue/Green, Feature Flags). Rollback must be possible within 5 minutes without a new deployment.
Requirements
-
All production deployments MUST use Canary, Blue/Green, or Feature Flags
-
Rollback MUST be possible in < 5 minutes without a new deployment
-
New features MUST be deployed behind feature flags
-
Auto-rollback MUST be configured upon error rate increase
Terraform Checks (Excerpt)
-
waf-ops-080.tf.aws.codedeploy-deployment-config– CodeDeploy with Canary or Linear configuration (not AllAtOnce) -
waf-ops-080.tf.aws.appconfig-feature-flag– AWS AppConfig Application for feature flags -
waf-ops-080.tf.azurerm.traffic-manager-canary– Azure Traffic Manager with weighted routing for canary
Evidence
-
Load balancer / deployment configuration with Progressive Delivery
-
Feature flag service configuration
Related Best Practice: Safe Deployments
WAF-OPS-090 – Configuration Drift Detection & Remediation
Severity: High | Category: Configuration Management | Automatable: High
Intent
All production infrastructure must be continuously compared against its IaC definition. Drift is automatically detected, reported, and remediated within defined SLAs.
Requirements
-
Automatic drift detection MUST run at least daily
-
Drift alerts MUST notify the responsible team within 1 hour
-
Emergency console changes MUST be transferred into IaC within 24 hours
-
Drift SLAs MUST be defined (critical: 4h, major: 24h, minor: 1 sprint)
Terraform Checks (Excerpt)
-
waf-ops-090.tf.aws.config-rule-enabled– AWS Config Recorder with compliance rules -
waf-ops-090.tf.aws.cloudtrail-enabled– CloudTrail as multi-region trail with log validation -
waf-ops-090.tf.azurerm.policy-assignment-drift– Azure Policy Initiative Assignment on subscription
Evidence
-
Drift detection configuration (EventBridge Schedule, AWS Config)
-
Drift SLA policy
Related Best Practice: Implementing Infrastructure as Code Consistently
WAF-OPS-100 – Operational Debt Register & Review
Severity: Medium | Category: Operational Governance | Automatable: Low
Intent
All known Operational Debt items must be documented in a version-controlled register, reviewed quarterly, and planned with sprint capacity for reduction.
Requirements
-
Operational Debt Register MUST be stored in version control
-
Every entry MUST have severity, toil hours, owner, and target date
-
Quarterly review MUST take place with prioritization and capacity allocation
-
Sprint capacity for debt reduction MUST be explicitly allocated (at least 10%)
Terraform Checks (Excerpt)
-
waf-ops-100.tf.aws.eventbridge-ops-review-schedule– EventBridge Scheduled Rule for quarterly review reminders -
waf-ops-100.tf.aws.ssm-automation-runbook– SSM Automation Documents for repetitive operational tasks
Evidence
-
Operational Debt Register (version-controlled)
-
Quarterly review minutes
Related Best Practice: Maintaining Runbooks and Operational Documentation