Maturity Model: Operational Excellence
The 5-Level OpsEx Maturity Model
| Level | Name | Characteristics |
|---|---|---|
1 |
Reactive & Heroic |
Deployments manual. No IaC. Logging unstructured. Alerts on CPU/RAM. Incidents resolved by heroes. No systematic learning. MTTR: hours to days. Deployment Frequency: weekly to monthly. |
2 |
Documented |
Basic CI/CD present. Parts of infrastructure as IaC. Runbooks for the worst scenarios. Informal incident reviews. MTTR: 1–4 hours. Deployment Frequency: daily to weekly. |
3 |
Automated |
Complete CI/CD pipeline. All infrastructure as IaC. Structured logging. Symptom-based alerting with runbooks. Blameless postmortems. MTTR: 30–60 minutes. Deployment Frequency: daily. |
4 |
Measured |
DORA metrics are measured and improved. SLO-based alerting. Drift detection automated. Feature Flags in use. Operational Debt Register maintained. MTTR: < 30 minutes. Deployment Frequency: multiple times daily. |
5 |
Continuously Improved |
Deployment possible hundreds of times daily. Change Failure Rate < 5%. Toil < 20% of engineering time. Full observability correlation. Automated drift remediation. Learning from incidents preventively. MTTR: < 1 hour. Deployment Frequency: on-demand. |
Per-Control Maturity Table
| Control | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
|---|---|---|---|---|---|
No pipeline |
Basic CI |
Complete CI/CD |
Metrics & Canary |
Continuous Deploy |
|
No IaC |
Inconsistent |
Fully enforced |
Drift Detection |
GitOps |
|
Unstructured |
Centralized |
All 3 pillars |
SLO-based |
OpenTelemetry |
|
None/Noise |
Basic alerting |
Symptom-based |
Burn Rate Alerts |
Auto-optimization |
|
No control |
Basic review |
Change process |
Auto risk scoring |
Continuous Deploy |
|
No runbooks |
Basic runbooks |
All linked |
Metrics & coverage |
Self-service |
|
No process |
Informal |
Structured |
Systemic analysis |
Org. learning |
|
Big-bang |
Basic safety |
Progressive Delivery |
Auto-rollback |
Experiment platform |
|
No detection |
Ad-hoc |
Auto-detection |
SLA enforcement |
Auto-remediation |
|
No tracking |
Informal |
Register maintained |
Debt program |
Continuous Improvement |
Self-Assessment Checklist Level 2
The following questions help determine whether Level 2 has been reached:
CI/CD & Deployments
-
Is there a CI pipeline that runs tests on pull requests?
-
Are deployment scripts versioned and documented?
-
Are deployments to staging and production handled separately?
Infrastructure
-
Are the most important production resources defined as IaC?
-
Is there a remote state backend (not local state)?
-
Are IaC changes reviewed via pull request?
Self-Assessment Checklist Level 3
CI/CD & Deployments
-
Are ALL production deployments automated (no manual path)?
-
Are branch protection and approval requirements configured?
-
Are there approval gates before production deployments?
-
Are pipeline definitions in version control (YAML, HCL)?
Infrastructure
-
Is 100% of production infrastructure defined as IaC?
-
Are manual console changes restricted via IAM/SCP?
-
Is there automated drift detection (at least daily)?
Observability
-
Do all services emit structured JSON logs with trace ID?
-
Is distributed tracing configured and instrumented?
-
Are RED metrics (Rate, Errors, Duration) exported for all services?
-
Are alerts symptom-based (error rate, latency, availability)?
Self-Assessment Checklist Level 4
DORA Metrics
-
Is deployment frequency measured and reported?
-
Is lead time for changes (commit to production) measured?
-
Is MTTR (Mean Time to Restore) captured per incident?
-
Is Change Failure Rate (deployments with rollback/incident) captured?
-
Are DORA trends reviewed quarterly?
Recommended Entry Path (Prioritized Action Table)
| Priority | Action | Why First | Control |
|---|---|---|---|
1 |
Build CI/CD pipeline for all production workloads |
Blocks all other OpsEx improvements; without a pipeline there is no automation path |
WAF-OPS-010 |
2 |
Enable structured logging + log aggregation |
Without logs every incident diagnosis is guesswork; quickly implementable |
WAF-OPS-030 |
3 |
Symptom-based alerts + runbooks for top-5 incidents |
Immediately reduces MTTR; prevents alert fatigue |
WAF-OPS-040, WAF-OPS-060 |
4 |
IaC for all production resources |
Enables drift detection, reproducible environments, safe changes |
WAF-OPS-020 |
5 |
Introduce postmortem process |
Breaks repeat-incident cycles; cultural change starts here |
WAF-OPS-070 |
6 |
Populate Operational Debt Register |
Makes accumulating debt visible; foundation for prioritization |
WAF-OPS-100 |
7 |
Formalize Change Management |
Reduces Change Failure Rate; foundation for deployment freeze and risk assessment |
WAF-OPS-050 |
8 |
Progressive Delivery (Canary/Feature Flags) |
Reduces blast radius; enables safe deployment without fear |
WAF-OPS-080 |
9 |
Automate drift detection |
Closes the gap between IaC and actual infrastructure |
WAF-OPS-090 |