Maturity Model: Operational Excellence

The 5-Level OpsEx Maturity Model

Level	Name	Characteristics
1	Reactive & Heroic	Deployments manual. No IaC. Logging unstructured. Alerts on CPU/RAM. Incidents resolved by heroes. No systematic learning. MTTR: hours to days. Deployment Frequency: weekly to monthly.
2	Documented	Basic CI/CD present. Parts of infrastructure as IaC. Runbooks for the worst scenarios. Informal incident reviews. MTTR: 1–4 hours. Deployment Frequency: daily to weekly.
3	Automated	Complete CI/CD pipeline. All infrastructure as IaC. Structured logging. Symptom-based alerting with runbooks. Blameless postmortems. MTTR: 30–60 minutes. Deployment Frequency: daily.
4	Measured	DORA metrics are measured and improved. SLO-based alerting. Drift detection automated. Feature Flags in use. Operational Debt Register maintained. MTTR: < 30 minutes. Deployment Frequency: multiple times daily.
5	Continuously Improved	Deployment possible hundreds of times daily. Change Failure Rate < 5%. Toil < 20% of engineering time. Full observability correlation. Automated drift remediation. Learning from incidents preventively. MTTR: < 1 hour. Deployment Frequency: on-demand.

Level

Name

Characteristics

Reactive & Heroic

Deployments manual. No IaC. Logging unstructured. Alerts on CPU/RAM. Incidents resolved by heroes. No systematic learning. MTTR: hours to days. Deployment Frequency: weekly to monthly.

Documented

Basic CI/CD present. Parts of infrastructure as IaC. Runbooks for the worst scenarios. Informal incident reviews. MTTR: 1–4 hours. Deployment Frequency: daily to weekly.

Automated

Complete CI/CD pipeline. All infrastructure as IaC. Structured logging. Symptom-based alerting with runbooks. Blameless postmortems. MTTR: 30–60 minutes. Deployment Frequency: daily.

Measured

DORA metrics are measured and improved. SLO-based alerting. Drift detection automated. Feature Flags in use. Operational Debt Register maintained. MTTR: < 30 minutes. Deployment Frequency: multiple times daily.

Continuously Improved

Deployment possible hundreds of times daily. Change Failure Rate < 5%. Toil < 20% of engineering time. Full observability correlation. Automated drift remediation. Learning from incidents preventively. MTTR: < 1 hour. Deployment Frequency: on-demand.

Per-Control Maturity Table

Control	Level 1	Level 2	Level 3	Level 4	Level 5
WAF-OPS-010 – CI/CD Pipeline	No pipeline	Basic CI	Complete CI/CD	Metrics & Canary	Continuous Deploy
WAF-OPS-020 – IaC	No IaC	Inconsistent	Fully enforced	Drift Detection	GitOps
WAF-OPS-030 – Observability	Unstructured	Centralized	All 3 pillars	SLO-based	OpenTelemetry
WAF-OPS-040 – Alerting	None/Noise	Basic alerting	Symptom-based	Burn Rate Alerts	Auto-optimization
WAF-OPS-050 – Change Mgmt	No control	Basic review	Change process	Auto risk scoring	Continuous Deploy
WAF-OPS-060 – Runbooks	No runbooks	Basic runbooks	All linked	Metrics & coverage	Self-service
WAF-OPS-070 – Postmortems	No process	Informal	Structured	Systemic analysis	Org. learning
WAF-OPS-080 – Safe Deploy	Big-bang	Basic safety	Progressive Delivery	Auto-rollback	Experiment platform
WAF-OPS-090 – Drift	No detection	Ad-hoc	Auto-detection	SLA enforcement	Auto-remediation
WAF-OPS-100 – Ops Debt	No tracking	Informal	Register maintained	Debt program	Continuous Improvement

Control

Level 1

Level 2

Level 3

Level 4

Level 5

WAF-OPS-010 – CI/CD Pipeline

No pipeline

Basic CI

Complete CI/CD

Metrics & Canary

Continuous Deploy

WAF-OPS-020 – IaC

No IaC

Inconsistent

Fully enforced

Drift Detection

GitOps

WAF-OPS-030 – Observability

Unstructured

Centralized

All 3 pillars

SLO-based

OpenTelemetry

WAF-OPS-040 – Alerting

None/Noise

Basic alerting

Symptom-based

Burn Rate Alerts

Auto-optimization

WAF-OPS-050 – Change Mgmt

No control

Basic review

Change process

Auto risk scoring

Continuous Deploy

WAF-OPS-060 – Runbooks

No runbooks

Basic runbooks

All linked

Metrics & coverage

Self-service

WAF-OPS-070 – Postmortems

No process

Informal

Structured

Systemic analysis

Org. learning

WAF-OPS-080 – Safe Deploy

Big-bang

Basic safety

Progressive Delivery

Auto-rollback

Experiment platform

WAF-OPS-090 – Drift

No detection

Ad-hoc

Auto-detection

SLA enforcement

Auto-remediation

WAF-OPS-100 – Ops Debt

No tracking

Informal

Debt program

Continuous Improvement

Self-Assessment Checklist Level 2

The following questions help determine whether Level 2 has been reached:

CI/CD & Deployments

Is there a CI pipeline that runs tests on pull requests?
Are deployment scripts versioned and documented?
Are deployments to staging and production handled separately?

Infrastructure

Are the most important production resources defined as IaC?
Is there a remote state backend (not local state)?
Are IaC changes reviewed via pull request?

Observability

Are logs centrally aggregated (CloudWatch, Azure Monitor, Elasticsearch)?
Are there basic dashboards with CPU, memory, request counts?
Are critical errors notified via email or Slack?

Runbooks & Documentation

Are there runbooks for the 3 most common incident types?
Is there a deployment runbook with rollback procedure?
Are on-call escalation paths documented?

Self-Assessment Checklist Level 3

CI/CD & Deployments

Are ALL production deployments automated (no manual path)?
Are branch protection and approval requirements configured?
Are there approval gates before production deployments?
Are pipeline definitions in version control (YAML, HCL)?

Infrastructure

Is 100% of production infrastructure defined as IaC?
Are manual console changes restricted via IAM/SCP?
Is there automated drift detection (at least daily)?

Observability

Do all services emit structured JSON logs with trace ID?
Is distributed tracing configured and instrumented?
Are RED metrics (Rate, Errors, Duration) exported for all services?
Are alerts symptom-based (error rate, latency, availability)?

Runbooks & Change Management

Are all paging alerts linked to runbooks?
Is a change management process with risk assessment defined?
Are deployment freeze policies for critical periods in place?

Postmortems

Is there a defined postmortem process for SEV-1 incidents?
Are at least 3 postmortems from the last 6 months documented?
Are action items from postmortems tracked?

Self-Assessment Checklist Level 4

DORA Metrics

Is deployment frequency measured and reported?
Is lead time for changes (commit to production) measured?
Is MTTR (Mean Time to Restore) captured per incident?
Is Change Failure Rate (deployments with rollback/incident) captured?
Are DORA trends reviewed quarterly?

Progressive Delivery

Are canary or blue/green deployments used for all services?
Is automatic rollback configured upon error rate increase?
Are feature flags used for new features?

Operational Debt

Is an Operational Debt Register version-controlled and current?
Does a quarterly debt review take place?
Is sprint capacity for debt reduction explicitly allocated (at least 10%)?

Recommended Entry Path (Prioritized Action Table)

Priority	Action	Why First	Control
1	Build CI/CD pipeline for all production workloads	Blocks all other OpsEx improvements; without a pipeline there is no automation path	WAF-OPS-010
2	Enable structured logging + log aggregation	Without logs every incident diagnosis is guesswork; quickly implementable	WAF-OPS-030
3	Symptom-based alerts + runbooks for top-5 incidents	Immediately reduces MTTR; prevents alert fatigue	WAF-OPS-040, WAF-OPS-060
4	IaC for all production resources	Enables drift detection, reproducible environments, safe changes	WAF-OPS-020
5	Introduce postmortem process	Breaks repeat-incident cycles; cultural change starts here	WAF-OPS-070
6	Populate Operational Debt Register	Makes accumulating debt visible; foundation for prioritization	WAF-OPS-100
7	Formalize Change Management	Reduces Change Failure Rate; foundation for deployment freeze and risk assessment	WAF-OPS-050
8	Progressive Delivery (Canary/Feature Flags)	Reduces blast radius; enables safe deployment without fear	WAF-OPS-080
9	Automate drift detection	Closes the gap between IaC and actual infrastructure	WAF-OPS-090

Priority

Action

Why First

Control

Build CI/CD pipeline for all production workloads

Blocks all other OpsEx improvements; without a pipeline there is no automation path

WAF-OPS-010

Enable structured logging + log aggregation

Without logs every incident diagnosis is guesswork; quickly implementable

WAF-OPS-030

Symptom-based alerts + runbooks for top-5 incidents

Immediately reduces MTTR; prevents alert fatigue

WAF-OPS-040, WAF-OPS-060

IaC for all production resources

Enables drift detection, reproducible environments, safe changes

WAF-OPS-020

Introduce postmortem process

Breaks repeat-incident cycles; cultural change starts here

WAF-OPS-070

Populate Operational Debt Register

Makes accumulating debt visible; foundation for prioritization

WAF-OPS-100

Formalize Change Management

Reduces Change Failure Rate; foundation for deployment freeze and risk assessment

WAF-OPS-050

Progressive Delivery (Canary/Feature Flags)

Reduces blast radius; enables safe deployment without fear

WAF-OPS-080

Automate drift detection

Closes the gap between IaC and actual infrastructure

WAF-OPS-090