WAF++

Back to WAF++ Homepage

Scope: Operational Excellence

What Is in Scope?

The Operational Excellence pillar addresses all aspects of the technical operation of cloud workloads:

CI/CD & Deployment Automation

Definition and versioning of deployment pipelines as code
Automation of all deployments to all environments (dev, staging, production)
Branch protection, pull request reviews, pipeline gate configuration
Artifact versioning and immutable deployment artifacts
Deployment frequency and lead time as DORA metrics

Infrastructure as Code

Declarative definition of all cloud resources as IaC (Terraform, Pulumi, CDK)
Remote state management with locking
Module libraries and code reuse
IaC review process (pull request, policy-as-code)
Brownfield migration to IaC

Observability

Structured logging (JSON, with trace ID, request ID, service name)
Distributed tracing (OpenTelemetry, AWS X-Ray, Jaeger)
Metrics: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors)
Dashboards and visualization
Log retention policies and cost governance

Change Management

Change categorization and risk assessment
Approval workflow for high-risk changes
Deployment freeze policies for critical business periods
Post-deployment verification
Change records and audit trail

Runbooks & Operational Documentation

Runbooks for all known failure scenarios
Operational procedures for routine tasks
Runbook-alert linking
Review cadence and update process

Post-Incident Reviews

Incident definition and trigger criteria
Blameless postmortem process
Action item tracking and closure
Incident trend analysis and organizational learning

Operational Debt

Toil identification and measurement
Operational Debt Register
Prioritization and sprint capacity allocation
Automation of routine processes

What Is NOT in Scope?

The following areas fall under other WAF++ pillars:

HR processes and team structure → Governance pillar
Non-technical operational processes (procurement, contract management) → Governance
SLO/SLA definition and fault tolerance → Reliability pillar
Security controls, encryption, IAM → Security pillar
Performance optimization, compute sizing → Performance pillar
Data protection and data residency → Sovereign pillar
Cost governance, FinOps, budgets → Cost pillar

Brownfield vs. Greenfield

Greenfield Workloads

For new workloads: embed OpsEx standards from the start.

Dimension	Greenfield Approach
CI/CD	Pipeline as the first artifact – before the first deployment. "Pipeline-First" principle.
IaC	No resource exists outside of Terraform. Remote state from day one.
Observability	OpenTelemetry instrumentation in the application template. Structured Logging as default.
Runbooks	Runbook template at the first deployment. Minimum: deployment runbook and rollback runbook.
Change Management	Branch protection and approval requirements from the first commit.

Dimension

Greenfield Approach

CI/CD

Pipeline as the first artifact – before the first deployment. "Pipeline-First" principle.

IaC

No resource exists outside of Terraform. Remote state from day one.

Observability

OpenTelemetry instrumentation in the application template. Structured Logging as default.

Runbooks

Runbook template at the first deployment. Minimum: deployment runbook and rollback runbook.

Change Management

Branch protection and approval requirements from the first commit.

Brownfield Workloads

For existing workloads, a risk-based migration plan is required:

Step	Action	Priority
1 – Assess	Inventory: which workloads have no pipeline, no IaC, no runbooks?	Immediate
2 – Quick Wins	Enable structured logging and alerting. Document existing deployments.	Sprint 1–2
3 – IaC Migration	Import existing resources into Terraform state. No rebuild, just codification.	Quarter 1
4 – Pipeline Build	Build CI/CD pipeline for existing deployments. Restrict manual access.	Quarter 1–2
5 – Runbook Creation	Document runbooks for top-5 failure scenarios per service.	Ongoing
6 – Debt Reduction	Populate Operational Debt Register, prioritize, allocate sprint capacity.	Quarterly

Step

Action

Priority

1 – Assess

Inventory: which workloads have no pipeline, no IaC, no runbooks?

Immediate

2 – Quick Wins

Enable structured logging and alerting. Document existing deployments.

Sprint 1–2

3 – IaC Migration

Import existing resources into Terraform state. No rebuild, just codification.

Quarter 1

4 – Pipeline Build

Build CI/CD pipeline for existing deployments. Restrict manual access.

Quarter 1–2

5 – Runbook Creation

Document runbooks for top-5 failure scenarios per service.

Ongoing

6 – Debt Reduction

Populate Operational Debt Register, prioritize, allocate sprint capacity.

Quarterly

Operational Debt – Common Sources

Debt Category	Description	Typical Impact
Manual Deployments	Deployments via SSH or console without pipeline	Inconsistent environments, missing audit trails
Console-configured Resources	Resources do not exist in IaC	Drift, not reproducible in DR scenario
Unstructured Logging	Text logs without schema, without trace ID	Long MTTR, costly incident diagnosis
Missing Runbooks	No documented process for known failure scenarios	On-call burnout, long MTTR, escalations
No Postmortems	Incidents resolved without structured learning	Recurring incidents of the same class
Alert Fatigue	Too many non-actionable alerts	Real alerts ignored; on-call burnout
Manual Monitoring	Dashboard observation instead of automated alerting	Incidents reported by users, not detected
Outdated Runbooks	Runbooks no longer reflect the current system state	Dangerous misinformation during incidents

Debt Category

Description

Typical Impact

Manual Deployments

Deployments via SSH or console without pipeline

Inconsistent environments, missing audit trails

Console-configured Resources

Resources do not exist in IaC

Drift, not reproducible in DR scenario

Unstructured Logging

Text logs without schema, without trace ID

Long MTTR, costly incident diagnosis

Missing Runbooks

No documented process for known failure scenarios

On-call burnout, long MTTR, escalations

No Postmortems

Incidents resolved without structured learning

Recurring incidents of the same class

Alert Fatigue

Too many non-actionable alerts

Real alerts ignored; on-call burnout

Manual Monitoring

Dashboard observation instead of automated alerting

Incidents reported by users, not detected

Outdated Runbooks

Runbooks no longer reflect the current system state

Dangerous misinformation during incidents