WAF++ WAF++
Back to WAF++ Homepage

Controls (WAF-OPS)

This page provides a narrative overview of all 10 controls of the Operational Excellence pillar. For complete Terraform examples and detailed implementation guidance see the Best Practices.

WAF-OPS-010 – CI/CD Pipeline Defined & Automated

Severity: High | Category: Deployment Automation | Automatable: High

Intent

Every production workload must have a defined, versioned CI/CD pipeline. Manual deployments to production are not permitted.

Requirements

  • Pipeline definitions MUST be stored in version control

  • All deployments to all environments MUST be automated

  • Branch protection MUST prevent direct commits to production branches

  • Approval gates MUST be configured before production deployments

Terraform Checks (Excerpt)

  • waf-ops-010.tf.aws.codepipeline-exists – AWS CodePipeline with Source, Build and Deploy stage

  • waf-ops-010.tf.azurerm.devops-pipeline-exists – Azure DevOps Build Definition with repository reference

  • waf-ops-010.tf.google.cloudbuild-trigger-exists – GCP Cloud Build Trigger with build configuration file

Evidence

  • Pipeline definition files in version control

  • Branch protection configuration


WAF-OPS-020 – Infrastructure as Code Enforced

Severity: High | Category: Infrastructure as Code | Automatable: High

Intent

All cloud infrastructure must be defined as code. Manual resource creation via cloud console is forbidden in production and staging.

Requirements

  • All production and staging infrastructure MUST be defined as IaC

  • Manual changes MUST be restricted via IAM/SCP policy

  • Remote state backend with locking MUST be configured

  • All IaC changes MUST go through pull request review

Terraform Checks (Excerpt)

  • waf-ops-020.tf.aws.s3-terraform-state-backend – Terraform remote state in S3 backend

  • waf-ops-020.tf.aws.s3-state-bucket-versioning – S3 state bucket with versioning enabled

  • waf-ops-020.tf.azurerm.terraform-state-storage – Azure Storage Account for state with blob versioning

Evidence

  • Terraform repository with remote state configuration

  • S3/Azure Storage state backend with versioning


WAF-OPS-030 – Observability Stack Configured

Severity: High | Category: Observability | Automatable: High

Intent

Every production workload must emit structured logs, expose metrics, and support distributed tracing. A centralized observability stack must be configured.

Requirements

  • All services MUST emit structured JSON logs with trace ID

  • Distributed tracing MUST be configured and instrumented (OpenTelemetry, X-Ray)

  • RED metrics MUST be exported for every service

  • Log retention MUST be at least 30 days (recommended: 90 days application, 365 days audit)

Terraform Checks (Excerpt)

  • waf-ops-030.tf.aws.cloudwatch-log-group-retention – CloudWatch Log Groups with retention policy

  • waf-ops-030.tf.aws.xray-tracing-enabled – Lambda Functions with active X-Ray tracing

  • waf-ops-030.tf.azurerm.log-analytics-workspace – Azure Log Analytics Workspace with retention

Evidence

  • Log group configuration with retention settings

  • Tracing configuration in application code

Related Best Practice: Building the Observability Stack


WAF-OPS-040 – Alerting on Symptoms, Not Causes

Severity: High | Category: Alerting | Automatable: High

Intent

All production alerts must be based on user-visible symptoms (error rate, latency, availability), not internal causes (CPU, memory). Every paging alert requires a runbook.

Requirements

  • Alerts MUST be symptom-based (error rate, latency, availability)

  • Every paging alert MUST have a runbook URL in its description

  • SLOs MUST be defined for all critical services

  • Alert noise metric MUST be tracked (goal: < 10 pages/week/engineer)

Terraform Checks (Excerpt)

  • waf-ops-040.tf.aws.cloudwatch-alarm-symptom-based – CloudWatch Alarm with description and alarm action

  • waf-ops-040.tf.azurerm.monitor-alert-symptom – Azure Monitor Alert with description and action group

  • waf-ops-040.tf.google.monitoring-alert-symptom – GCP Monitoring Alert Policy with notification channels

Evidence

  • Alert rule definitions with symptom-based metrics

  • SLO definitions for critical services


WAF-OPS-050 – Change Management & Deployment Risk Assessment

Severity: Medium | Category: Change Management | Automatable: Medium

Intent

All production changes must go through a defined change management process with risk assessment, approval workflow, and post-deployment verification.

Requirements

  • Change categories MUST be defined (Standard, Normal, Emergency)

  • High-risk changes MUST require multi-person approval

  • Deployment freeze policies MUST be configured for critical periods

  • Change records MUST be linked to deployment artifacts

Terraform Checks (Excerpt)

  • waf-ops-050.tf.aws.codepipeline-manual-approval – CodePipeline with manual approval stage before production

  • waf-ops-050.tf.azurerm.devops-environment-approval – Azure DevOps Environment with approval checks

Evidence

  • Change management policy

  • Branch protection and approval configuration


WAF-OPS-060 – Runbook & Operational Documentation Coverage

Severity: Medium | Category: Documentation | Automatable: Low

Intent

Every production workload must have runbooks for all known failure scenarios. Runbooks must be versioned, regularly reviewed, and linked to alerts.

Requirements

  • All paging alerts MUST be linked to runbooks

  • Runbooks MUST be stored in version control and reviewed regularly (quarterly)

  • Runbook coverage metric MUST be tracked (goal: >= 90% for critical services)

  • Runbooks MUST be accessible to on-call engineers without authentication barriers

Terraform Checks (Excerpt)

  • waf-ops-060.tf.aws.cloudwatch-alarm-runbook-annotation – CloudWatch Alarms with runbook URL in description

  • waf-ops-060.tf.aws.prometheus-alert-runbook-label – Prometheus Alert Rules with runbook_url annotation

Evidence

  • Runbook directory with version history

  • Runbook coverage report


WAF-OPS-070 – Post-Incident Review Process

Severity: Medium | Category: Incident Learning | Automatable: Low

Intent

Every production incident with user impact or SLO violation must trigger a blameless postmortem within 5 business days. Action items are tracked and resolved.

Requirements

  • All SEV-1/P1 incidents and SLO violations MUST trigger a postmortem

  • Postmortems MUST be blameless and produce action items with owners

  • Action items MUST be tracked in JIRA or equivalent

  • Postmortems MUST be completed within 5 business days

Terraform Checks (Excerpt)

  • waf-ops-070.tf.aws.incident-management-sns-topic – SNS Topic for incident notifications

  • waf-ops-070.tf.azurerm.action-group-incident – Azure Monitor Action Group with configured recipients

Evidence

  • Post-Incident Review policy

  • Postmortem archive (last 3 months)


WAF-OPS-080 – Feature Flag & Safe Deployment Patterns

Severity: Medium | Category: Deployment Safety | Automatable: High

Intent

Production deployments must use Progressive Delivery patterns (Canary, Blue/Green, Feature Flags). Rollback must be possible within 5 minutes without a new deployment.

Requirements

  • All production deployments MUST use Canary, Blue/Green, or Feature Flags

  • Rollback MUST be possible in < 5 minutes without a new deployment

  • New features MUST be deployed behind feature flags

  • Auto-rollback MUST be configured upon error rate increase

Terraform Checks (Excerpt)

  • waf-ops-080.tf.aws.codedeploy-deployment-config – CodeDeploy with Canary or Linear configuration (not AllAtOnce)

  • waf-ops-080.tf.aws.appconfig-feature-flag – AWS AppConfig Application for feature flags

  • waf-ops-080.tf.azurerm.traffic-manager-canary – Azure Traffic Manager with weighted routing for canary

Evidence

  • Load balancer / deployment configuration with Progressive Delivery

  • Feature flag service configuration

Related Best Practice: Safe Deployments


WAF-OPS-090 – Configuration Drift Detection & Remediation

Severity: High | Category: Configuration Management | Automatable: High

Intent

All production infrastructure must be continuously compared against its IaC definition. Drift is automatically detected, reported, and remediated within defined SLAs.

Requirements

  • Automatic drift detection MUST run at least daily

  • Drift alerts MUST notify the responsible team within 1 hour

  • Emergency console changes MUST be transferred into IaC within 24 hours

  • Drift SLAs MUST be defined (critical: 4h, major: 24h, minor: 1 sprint)

Terraform Checks (Excerpt)

  • waf-ops-090.tf.aws.config-rule-enabled – AWS Config Recorder with compliance rules

  • waf-ops-090.tf.aws.cloudtrail-enabled – CloudTrail as multi-region trail with log validation

  • waf-ops-090.tf.azurerm.policy-assignment-drift – Azure Policy Initiative Assignment on subscription

Evidence

  • Drift detection configuration (EventBridge Schedule, AWS Config)

  • Drift SLA policy


WAF-OPS-100 – Operational Debt Register & Review

Severity: Medium | Category: Operational Governance | Automatable: Low

Intent

All known Operational Debt items must be documented in a version-controlled register, reviewed quarterly, and planned with sprint capacity for reduction.

Requirements

  • Operational Debt Register MUST be stored in version control

  • Every entry MUST have severity, toil hours, owner, and target date

  • Quarterly review MUST take place with prioritization and capacity allocation

  • Sprint capacity for debt reduction MUST be explicitly allocated (at least 10%)

Terraform Checks (Excerpt)

  • waf-ops-100.tf.aws.eventbridge-ops-review-schedule – EventBridge Scheduled Rule for quarterly review reminders

  • waf-ops-100.tf.aws.ssm-automation-runbook – SSM Automation Documents for repetitive operational tasks

Evidence

  • Operational Debt Register (version-controlled)

  • Quarterly review minutes