Controls (WAF-OPS)

This page provides a narrative overview of all 10 controls of the Operational Excellence pillar. For complete Terraform examples and detailed implementation guidance see the Best Practices.

WAF-OPS-010 – CI/CD Pipeline Defined & Automated

Severity: High | Category: Deployment Automation | Automatable: High

Intent

Every production workload must have a defined, versioned CI/CD pipeline. Manual deployments to production are not permitted.

Requirements

Pipeline definitions MUST be stored in version control
All deployments to all environments MUST be automated
Branch protection MUST prevent direct commits to production branches
Approval gates MUST be configured before production deployments

Terraform Checks (Excerpt)

waf-ops-010.tf.aws.codepipeline-exists – AWS CodePipeline with Source, Build and Deploy stage
waf-ops-010.tf.azurerm.devops-pipeline-exists – Azure DevOps Build Definition with repository reference
waf-ops-010.tf.google.cloudbuild-trigger-exists – GCP Cloud Build Trigger with build configuration file

Evidence

Pipeline definition files in version control
Branch protection configuration

Related Best Practice: Building and Securing the CI/CD Pipeline

WAF-OPS-020 – Infrastructure as Code Enforced

Severity: High | Category: Infrastructure as Code | Automatable: High

Intent

All cloud infrastructure must be defined as code. Manual resource creation via cloud console is forbidden in production and staging.

Requirements

All production and staging infrastructure MUST be defined as IaC
Manual changes MUST be restricted via IAM/SCP policy
Remote state backend with locking MUST be configured
All IaC changes MUST go through pull request review

Terraform Checks (Excerpt)

waf-ops-020.tf.aws.s3-terraform-state-backend – Terraform remote state in S3 backend
waf-ops-020.tf.aws.s3-state-bucket-versioning – S3 state bucket with versioning enabled
waf-ops-020.tf.azurerm.terraform-state-storage – Azure Storage Account for state with blob versioning

Evidence

Terraform repository with remote state configuration
S3/Azure Storage state backend with versioning

Related Best Practice: Implementing Infrastructure as Code Consistently

WAF-OPS-030 – Observability Stack Configured

Severity: High | Category: Observability | Automatable: High

Intent

Every production workload must emit structured logs, expose metrics, and support distributed tracing. A centralized observability stack must be configured.

Requirements

All services MUST emit structured JSON logs with trace ID
Distributed tracing MUST be configured and instrumented (OpenTelemetry, X-Ray)
RED metrics MUST be exported for every service
Log retention MUST be at least 30 days (recommended: 90 days application, 365 days audit)

Terraform Checks (Excerpt)

waf-ops-030.tf.aws.cloudwatch-log-group-retention – CloudWatch Log Groups with retention policy
waf-ops-030.tf.aws.xray-tracing-enabled – Lambda Functions with active X-Ray tracing
waf-ops-030.tf.azurerm.log-analytics-workspace – Azure Log Analytics Workspace with retention

Evidence

Log group configuration with retention settings
Tracing configuration in application code

Related Best Practice: Building the Observability Stack

WAF-OPS-040 – Alerting on Symptoms, Not Causes

Severity: High | Category: Alerting | Automatable: High

Intent

All production alerts must be based on user-visible symptoms (error rate, latency, availability), not internal causes (CPU, memory). Every paging alert requires a runbook.

Requirements

Alerts MUST be symptom-based (error rate, latency, availability)
Every paging alert MUST have a runbook URL in its description
SLOs MUST be defined for all critical services
Alert noise metric MUST be tracked (goal: < 10 pages/week/engineer)

Terraform Checks (Excerpt)

waf-ops-040.tf.aws.cloudwatch-alarm-symptom-based – CloudWatch Alarm with description and alarm action
waf-ops-040.tf.azurerm.monitor-alert-symptom – Azure Monitor Alert with description and action group
waf-ops-040.tf.google.monitoring-alert-symptom – GCP Monitoring Alert Policy with notification channels

Evidence

Alert rule definitions with symptom-based metrics
SLO definitions for critical services

Related Best Practice: Alerting on Symptoms Instead of Causes

WAF-OPS-050 – Change Management & Deployment Risk Assessment

Severity: Medium | Category: Change Management | Automatable: Medium

Intent

All production changes must go through a defined change management process with risk assessment, approval workflow, and post-deployment verification.

Requirements

Change categories MUST be defined (Standard, Normal, Emergency)
High-risk changes MUST require multi-person approval
Deployment freeze policies MUST be configured for critical periods
Change records MUST be linked to deployment artifacts

Terraform Checks (Excerpt)

waf-ops-050.tf.aws.codepipeline-manual-approval – CodePipeline with manual approval stage before production
waf-ops-050.tf.azurerm.devops-environment-approval – Azure DevOps Environment with approval checks

Evidence

Change management policy
Branch protection and approval configuration

Related Best Practice: Building and Securing the CI/CD Pipeline

WAF-OPS-060 – Runbook & Operational Documentation Coverage

Severity: Medium | Category: Documentation | Automatable: Low

Intent

Every production workload must have runbooks for all known failure scenarios. Runbooks must be versioned, regularly reviewed, and linked to alerts.

Requirements

All paging alerts MUST be linked to runbooks
Runbooks MUST be stored in version control and reviewed regularly (quarterly)
Runbook coverage metric MUST be tracked (goal: >= 90% for critical services)
Runbooks MUST be accessible to on-call engineers without authentication barriers

Terraform Checks (Excerpt)

waf-ops-060.tf.aws.cloudwatch-alarm-runbook-annotation – CloudWatch Alarms with runbook URL in description
waf-ops-060.tf.aws.prometheus-alert-runbook-label – Prometheus Alert Rules with runbook_url annotation

Evidence

Runbook directory with version history
Runbook coverage report

Related Best Practice: Maintaining Runbooks and Operational Documentation

WAF-OPS-070 – Post-Incident Review Process

Severity: Medium | Category: Incident Learning | Automatable: Low

Intent

Every production incident with user impact or SLO violation must trigger a blameless postmortem within 5 business days. Action items are tracked and resolved.

Requirements

All SEV-1/P1 incidents and SLO violations MUST trigger a postmortem
Postmortems MUST be blameless and produce action items with owners
Action items MUST be tracked in JIRA or equivalent
Postmortems MUST be completed within 5 business days

Terraform Checks (Excerpt)

waf-ops-070.tf.aws.incident-management-sns-topic – SNS Topic for incident notifications
waf-ops-070.tf.azurerm.action-group-incident – Azure Monitor Action Group with configured recipients

Evidence

Post-Incident Review policy
Postmortem archive (last 3 months)

Related Best Practice: Blameless Postmortems and Continuous Learning

WAF-OPS-080 – Feature Flag & Safe Deployment Patterns

Severity: Medium | Category: Deployment Safety | Automatable: High

Intent

Production deployments must use Progressive Delivery patterns (Canary, Blue/Green, Feature Flags). Rollback must be possible within 5 minutes without a new deployment.

Requirements

All production deployments MUST use Canary, Blue/Green, or Feature Flags
Rollback MUST be possible in < 5 minutes without a new deployment
New features MUST be deployed behind feature flags
Auto-rollback MUST be configured upon error rate increase

Terraform Checks (Excerpt)

waf-ops-080.tf.aws.codedeploy-deployment-config – CodeDeploy with Canary or Linear configuration (not AllAtOnce)
waf-ops-080.tf.aws.appconfig-feature-flag – AWS AppConfig Application for feature flags
waf-ops-080.tf.azurerm.traffic-manager-canary – Azure Traffic Manager with weighted routing for canary

Evidence

Load balancer / deployment configuration with Progressive Delivery
Feature flag service configuration

Related Best Practice: Safe Deployments

WAF-OPS-090 – Configuration Drift Detection & Remediation

Severity: High | Category: Configuration Management | Automatable: High

Intent

All production infrastructure must be continuously compared against its IaC definition. Drift is automatically detected, reported, and remediated within defined SLAs.

Requirements

Automatic drift detection MUST run at least daily
Drift alerts MUST notify the responsible team within 1 hour
Emergency console changes MUST be transferred into IaC within 24 hours
Drift SLAs MUST be defined (critical: 4h, major: 24h, minor: 1 sprint)

Terraform Checks (Excerpt)

waf-ops-090.tf.aws.config-rule-enabled – AWS Config Recorder with compliance rules
waf-ops-090.tf.aws.cloudtrail-enabled – CloudTrail as multi-region trail with log validation
waf-ops-090.tf.azurerm.policy-assignment-drift – Azure Policy Initiative Assignment on subscription

Evidence

Drift detection configuration (EventBridge Schedule, AWS Config)
Drift SLA policy

Related Best Practice: Implementing Infrastructure as Code Consistently

WAF-OPS-100 – Operational Debt Register & Review

Severity: Medium | Category: Operational Governance | Automatable: Low

Intent

All known Operational Debt items must be documented in a version-controlled register, reviewed quarterly, and planned with sprint capacity for reduction.

Requirements

Operational Debt Register MUST be stored in version control
Every entry MUST have severity, toil hours, owner, and target date
Quarterly review MUST take place with prioritization and capacity allocation
Sprint capacity for debt reduction MUST be explicitly allocated (at least 10%)

Terraform Checks (Excerpt)

waf-ops-100.tf.aws.eventbridge-ops-review-schedule – EventBridge Scheduled Rule for quarterly review reminders
waf-ops-100.tf.aws.ssm-automation-runbook – SSM Automation Documents for repetitive operational tasks

Evidence

Operational Debt Register (version-controlled)
Quarterly review minutes

Related Best Practice: Maintaining Runbooks and Operational Documentation