WAF++ WAF++
Back to WAF++ Homepage

Scope: Operational Excellence

What Is in Scope?

The Operational Excellence pillar addresses all aspects of the technical operation of cloud workloads:

CI/CD & Deployment Automation

  • Definition and versioning of deployment pipelines as code

  • Automation of all deployments to all environments (dev, staging, production)

  • Branch protection, pull request reviews, pipeline gate configuration

  • Artifact versioning and immutable deployment artifacts

  • Deployment frequency and lead time as DORA metrics

Infrastructure as Code

  • Declarative definition of all cloud resources as IaC (Terraform, Pulumi, CDK)

  • Remote state management with locking

  • Module libraries and code reuse

  • IaC review process (pull request, policy-as-code)

  • Brownfield migration to IaC

Observability

  • Structured logging (JSON, with trace ID, request ID, service name)

  • Distributed tracing (OpenTelemetry, AWS X-Ray, Jaeger)

  • Metrics: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors)

  • Dashboards and visualization

  • Log retention policies and cost governance

Change Management

  • Change categorization and risk assessment

  • Approval workflow for high-risk changes

  • Deployment freeze policies for critical business periods

  • Post-deployment verification

  • Change records and audit trail

Runbooks & Operational Documentation

  • Runbooks for all known failure scenarios

  • Operational procedures for routine tasks

  • Runbook-alert linking

  • Review cadence and update process

Post-Incident Reviews

  • Incident definition and trigger criteria

  • Blameless postmortem process

  • Action item tracking and closure

  • Incident trend analysis and organizational learning

Operational Debt

  • Toil identification and measurement

  • Operational Debt Register

  • Prioritization and sprint capacity allocation

  • Automation of routine processes

What Is NOT in Scope?

The following areas fall under other WAF++ pillars:

  • HR processes and team structure → Governance pillar

  • Non-technical operational processes (procurement, contract management) → Governance

  • SLO/SLA definition and fault tolerance → Reliability pillar

  • Security controls, encryption, IAM → Security pillar

  • Performance optimization, compute sizing → Performance pillar

  • Data protection and data residency → Sovereign pillar

  • Cost governance, FinOps, budgets → Cost pillar

Brownfield vs. Greenfield

Greenfield Workloads

For new workloads: embed OpsEx standards from the start.

Dimension Greenfield Approach

CI/CD

Pipeline as the first artifact – before the first deployment. "Pipeline-First" principle.

IaC

No resource exists outside of Terraform. Remote state from day one.

Observability

OpenTelemetry instrumentation in the application template. Structured Logging as default.

Runbooks

Runbook template at the first deployment. Minimum: deployment runbook and rollback runbook.

Change Management

Branch protection and approval requirements from the first commit.

Brownfield Workloads

For existing workloads, a risk-based migration plan is required:

Step Action Priority

1 – Assess

Inventory: which workloads have no pipeline, no IaC, no runbooks?

Immediate

2 – Quick Wins

Enable structured logging and alerting. Document existing deployments.

Sprint 1–2

3 – IaC Migration

Import existing resources into Terraform state. No rebuild, just codification.

Quarter 1

4 – Pipeline Build

Build CI/CD pipeline for existing deployments. Restrict manual access.

Quarter 1–2

5 – Runbook Creation

Document runbooks for top-5 failure scenarios per service.

Ongoing

6 – Debt Reduction

Populate Operational Debt Register, prioritize, allocate sprint capacity.

Quarterly

Operational Debt – Common Sources

Debt Category Description Typical Impact

Manual Deployments

Deployments via SSH or console without pipeline

Inconsistent environments, missing audit trails

Console-configured Resources

Resources do not exist in IaC

Drift, not reproducible in DR scenario

Unstructured Logging

Text logs without schema, without trace ID

Long MTTR, costly incident diagnosis

Missing Runbooks

No documented process for known failure scenarios

On-call burnout, long MTTR, escalations

No Postmortems

Incidents resolved without structured learning

Recurring incidents of the same class

Alert Fatigue

Too many non-actionable alerts

Real alerts ignored; on-call burnout

Manual Monitoring

Dashboard observation instead of automated alerting

Incidents reported by users, not detected

Outdated Runbooks

Runbooks no longer reflect the current system state

Dangerous misinformation during incidents