WAF++ WAF++
Back to WAF++ Homepage

Maturity Model: Operational Excellence

The 5-Level OpsEx Maturity Model

Level Name Characteristics

1

Reactive & Heroic

Deployments manual. No IaC. Logging unstructured. Alerts on CPU/RAM. Incidents resolved by heroes. No systematic learning. MTTR: hours to days. Deployment Frequency: weekly to monthly.

2

Documented

Basic CI/CD present. Parts of infrastructure as IaC. Runbooks for the worst scenarios. Informal incident reviews. MTTR: 1–4 hours. Deployment Frequency: daily to weekly.

3

Automated

Complete CI/CD pipeline. All infrastructure as IaC. Structured logging. Symptom-based alerting with runbooks. Blameless postmortems. MTTR: 30–60 minutes. Deployment Frequency: daily.

4

Measured

DORA metrics are measured and improved. SLO-based alerting. Drift detection automated. Feature Flags in use. Operational Debt Register maintained. MTTR: < 30 minutes. Deployment Frequency: multiple times daily.

5

Continuously Improved

Deployment possible hundreds of times daily. Change Failure Rate < 5%. Toil < 20% of engineering time. Full observability correlation. Automated drift remediation. Learning from incidents preventively. MTTR: < 1 hour. Deployment Frequency: on-demand.

Per-Control Maturity Table

Control Level 1 Level 2 Level 3 Level 4 Level 5

WAF-OPS-010 – CI/CD Pipeline

No pipeline

Basic CI

Complete CI/CD

Metrics & Canary

Continuous Deploy

WAF-OPS-020 – IaC

No IaC

Inconsistent

Fully enforced

Drift Detection

GitOps

WAF-OPS-030 – Observability

Unstructured

Centralized

All 3 pillars

SLO-based

OpenTelemetry

WAF-OPS-040 – Alerting

None/Noise

Basic alerting

Symptom-based

Burn Rate Alerts

Auto-optimization

WAF-OPS-050 – Change Mgmt

No control

Basic review

Change process

Auto risk scoring

Continuous Deploy

WAF-OPS-060 – Runbooks

No runbooks

Basic runbooks

All linked

Metrics & coverage

Self-service

WAF-OPS-070 – Postmortems

No process

Informal

Structured

Systemic analysis

Org. learning

WAF-OPS-080 – Safe Deploy

Big-bang

Basic safety

Progressive Delivery

Auto-rollback

Experiment platform

WAF-OPS-090 – Drift

No detection

Ad-hoc

Auto-detection

SLA enforcement

Auto-remediation

WAF-OPS-100 – Ops Debt

No tracking

Informal

Register maintained

Debt program

Continuous Improvement

Self-Assessment Checklist Level 2

The following questions help determine whether Level 2 has been reached:

CI/CD & Deployments

  • Is there a CI pipeline that runs tests on pull requests?

  • Are deployment scripts versioned and documented?

  • Are deployments to staging and production handled separately?

Infrastructure

  • Are the most important production resources defined as IaC?

  • Is there a remote state backend (not local state)?

  • Are IaC changes reviewed via pull request?

Observability

  • Are logs centrally aggregated (CloudWatch, Azure Monitor, Elasticsearch)?

  • Are there basic dashboards with CPU, memory, request counts?

  • Are critical errors notified via email or Slack?

Runbooks & Documentation

  • Are there runbooks for the 3 most common incident types?

  • Is there a deployment runbook with rollback procedure?

  • Are on-call escalation paths documented?

Self-Assessment Checklist Level 3

CI/CD & Deployments

  • Are ALL production deployments automated (no manual path)?

  • Are branch protection and approval requirements configured?

  • Are there approval gates before production deployments?

  • Are pipeline definitions in version control (YAML, HCL)?

Infrastructure

  • Is 100% of production infrastructure defined as IaC?

  • Are manual console changes restricted via IAM/SCP?

  • Is there automated drift detection (at least daily)?

Observability

  • Do all services emit structured JSON logs with trace ID?

  • Is distributed tracing configured and instrumented?

  • Are RED metrics (Rate, Errors, Duration) exported for all services?

  • Are alerts symptom-based (error rate, latency, availability)?

Runbooks & Change Management

  • Are all paging alerts linked to runbooks?

  • Is a change management process with risk assessment defined?

  • Are deployment freeze policies for critical periods in place?

Postmortems

  • Is there a defined postmortem process for SEV-1 incidents?

  • Are at least 3 postmortems from the last 6 months documented?

  • Are action items from postmortems tracked?

Self-Assessment Checklist Level 4

DORA Metrics

  • Is deployment frequency measured and reported?

  • Is lead time for changes (commit to production) measured?

  • Is MTTR (Mean Time to Restore) captured per incident?

  • Is Change Failure Rate (deployments with rollback/incident) captured?

  • Are DORA trends reviewed quarterly?

Progressive Delivery

  • Are canary or blue/green deployments used for all services?

  • Is automatic rollback configured upon error rate increase?

  • Are feature flags used for new features?

Operational Debt

  • Is an Operational Debt Register version-controlled and current?

  • Does a quarterly debt review take place?

  • Is sprint capacity for debt reduction explicitly allocated (at least 10%)?

Recommended Entry Path (Prioritized Action Table)

Priority Action Why First Control

1

Build CI/CD pipeline for all production workloads

Blocks all other OpsEx improvements; without a pipeline there is no automation path

WAF-OPS-010

2

Enable structured logging + log aggregation

Without logs every incident diagnosis is guesswork; quickly implementable

WAF-OPS-030

3

Symptom-based alerts + runbooks for top-5 incidents

Immediately reduces MTTR; prevents alert fatigue

WAF-OPS-040, WAF-OPS-060

4

IaC for all production resources

Enables drift detection, reproducible environments, safe changes

WAF-OPS-020

5

Introduce postmortem process

Breaks repeat-incident cycles; cultural change starts here

WAF-OPS-070

6

Populate Operational Debt Register

Makes accumulating debt visible; foundation for prioritization

WAF-OPS-100

7

Formalize Change Management

Reduces Change Failure Rate; foundation for deployment freeze and risk assessment

WAF-OPS-050

8

Progressive Delivery (Canary/Feature Flags)

Reduces blast radius; enables safe deployment without fear

WAF-OPS-080

9

Automate drift detection

Closes the gap between IaC and actual infrastructure

WAF-OPS-090