Scope: Operational Excellence
What Is in Scope?
The Operational Excellence pillar addresses all aspects of the technical operation of cloud workloads:
CI/CD & Deployment Automation
-
Definition and versioning of deployment pipelines as code
-
Automation of all deployments to all environments (dev, staging, production)
-
Branch protection, pull request reviews, pipeline gate configuration
-
Artifact versioning and immutable deployment artifacts
-
Deployment frequency and lead time as DORA metrics
Infrastructure as Code
-
Declarative definition of all cloud resources as IaC (Terraform, Pulumi, CDK)
-
Remote state management with locking
-
Module libraries and code reuse
-
IaC review process (pull request, policy-as-code)
-
Brownfield migration to IaC
Observability
-
Structured logging (JSON, with trace ID, request ID, service name)
-
Distributed tracing (OpenTelemetry, AWS X-Ray, Jaeger)
-
Metrics: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors)
-
Dashboards and visualization
-
Log retention policies and cost governance
Change Management
-
Change categorization and risk assessment
-
Approval workflow for high-risk changes
-
Deployment freeze policies for critical business periods
-
Post-deployment verification
-
Change records and audit trail
Runbooks & Operational Documentation
-
Runbooks for all known failure scenarios
-
Operational procedures for routine tasks
-
Runbook-alert linking
-
Review cadence and update process
What Is NOT in Scope?
The following areas fall under other WAF++ pillars:
-
HR processes and team structure → Governance pillar
-
Non-technical operational processes (procurement, contract management) → Governance
-
SLO/SLA definition and fault tolerance → Reliability pillar
-
Security controls, encryption, IAM → Security pillar
-
Performance optimization, compute sizing → Performance pillar
-
Data protection and data residency → Sovereign pillar
-
Cost governance, FinOps, budgets → Cost pillar
Brownfield vs. Greenfield
Greenfield Workloads
For new workloads: embed OpsEx standards from the start.
| Dimension | Greenfield Approach |
|---|---|
CI/CD |
Pipeline as the first artifact – before the first deployment. "Pipeline-First" principle. |
IaC |
No resource exists outside of Terraform. Remote state from day one. |
Observability |
OpenTelemetry instrumentation in the application template. Structured Logging as default. |
Runbooks |
Runbook template at the first deployment. Minimum: deployment runbook and rollback runbook. |
Change Management |
Branch protection and approval requirements from the first commit. |
Brownfield Workloads
For existing workloads, a risk-based migration plan is required:
| Step | Action | Priority |
|---|---|---|
1 – Assess |
Inventory: which workloads have no pipeline, no IaC, no runbooks? |
Immediate |
2 – Quick Wins |
Enable structured logging and alerting. Document existing deployments. |
Sprint 1–2 |
3 – IaC Migration |
Import existing resources into Terraform state. No rebuild, just codification. |
Quarter 1 |
4 – Pipeline Build |
Build CI/CD pipeline for existing deployments. Restrict manual access. |
Quarter 1–2 |
5 – Runbook Creation |
Document runbooks for top-5 failure scenarios per service. |
Ongoing |
6 – Debt Reduction |
Populate Operational Debt Register, prioritize, allocate sprint capacity. |
Quarterly |
Operational Debt – Common Sources
| Debt Category | Description | Typical Impact |
|---|---|---|
Manual Deployments |
Deployments via SSH or console without pipeline |
Inconsistent environments, missing audit trails |
Console-configured Resources |
Resources do not exist in IaC |
Drift, not reproducible in DR scenario |
Unstructured Logging |
Text logs without schema, without trace ID |
Long MTTR, costly incident diagnosis |
Missing Runbooks |
No documented process for known failure scenarios |
On-call burnout, long MTTR, escalations |
No Postmortems |
Incidents resolved without structured learning |
Recurring incidents of the same class |
Alert Fatigue |
Too many non-actionable alerts |
Real alerts ignored; on-call burnout |
Manual Monitoring |
Dashboard observation instead of automated alerting |
Incidents reported by users, not detected |
Outdated Runbooks |
Runbooks no longer reflect the current system state |
Dangerous misinformation during incidents |