WAF++ WAF++
Back to WAF++ Homepage

Maturity Model: Reliability

The Reliability maturity model describes five stages from a chaotic initial state to self-healing, adaptive infrastructure. It serves as a self-assessment tool and roadmap for structured improvement.

Overview of the 5 Stages

Stage Name Characteristics

1

Chaotic

No SLOs, no health checks, single-AZ, no backups tested. Incidents handled reactively. Reliability is not measured.

2

Documented

SLOs documented, basic monitoring, backups configured, on-call in place – but little of it tested or automated.

3

Enforced

IaC-enforced controls: Multi-AZ, Health Checks, Circuit Breakers. Backups tested, DR plan in place, runbooks linked for critical alerts.

4

Measured

SLOs with error budget, MTTR tracked, chaos tests quarterly, automated DR runbooks, Reliability Debt Register actively managed.

5

Self-Healing

Automatic remediation, adaptive SLOs, continuous chaos validation, GameDay culture, reliability as a board-level metric.

Per-Control Maturity Overview

Control L1 L2 L3 L4 L5

REL-010 – SLO & SLA

No SLOs

Documented

Monitored + Alerts

Error Budget Active

Adaptive + Predictive

REL-020 – Health Checks

None

LB Checks

ReadinessProbe + LivenessProbe

StartupProbe + Deep Checks

Synthetic Monitoring

REL-030 – Multi-AZ

Single-AZ

DB Multi-AZ

Everything Multi-AZ

Auto-Failover Tested

Multi-Region

REL-040 – Backup & Recovery

No Backups

Backups present, untested

PITR + Restore Test

Automated Monthly

WORM + CDP

REL-050 – Circuit Breaker

No Timeouts

Timeouts Configured

Circuit Breaker + Retry

Bulkheads + Service Mesh

Adaptive Thresholds

REL-060 – Incident Response

Ad-hoc

Process Documented

Runbooks Linked, MTTR Tracked

Automated Triage

Self-Healing + AIOps

REL-070 – DR Testing

No DR Plan

Annual Test

Semi-Annual Documented

Quarterly Automated

GameDay + Continuous

REL-080 – Dependency Resilience

No Inventory

Basic Inventory

Classified + CB

Auto-Discovery + Monitoring

Proactive Risk Management

REL-090 – Chaos Engineering

No Chaos

Ad-hoc Tests

Structured + Documented

Production Chaos Controlled

Continuous + ML

REL-100 – Reliability Debt

No Tracking

Ad-hoc Notes

Formal Register + Quarterly Review

Integrated in ADRs

Automated Detection

Self-Assessment Checklist: Level 2

Reaching level 2 requires:

  • Availability SLO for all production workloads documented (WAF-REL-010)

  • Latency SLO (p99) for all HTTP services documented

  • At least one monitoring dashboard per service

  • ALB/NLB health checks configured (WAF-REL-020)

  • RDS or equivalent database in Multi-AZ mode (WAF-REL-030)

  • Automated backups for all production databases enabled (WAF-REL-040)

  • Backup retention period >= 7 days

  • On-call rotation configured (WAF-REL-060)

  • Severity definitions (SEV1–SEV4) documented

  • Basic dependency inventory in place (WAF-REL-080)

Self-Assessment Checklist: Level 3

Reaching level 3 requires (additive to Level 2):

  • SLO monitoring with automatic alerting configured (WAF-REL-010)

  • Error budget burn rate alerts configured

  • readinessProbe and livenessProbe for all Kubernetes workloads (WAF-REL-020)

  • All compute resources distributed across at least 2 AZs (WAF-REL-030)

  • PITR for all production databases enabled (WAF-REL-040)

  • Backup restore tested and documented at least once

  • Explicit timeouts for all outgoing HTTP calls (WAF-REL-050)

  • Circuit breaker configured for all critical dependencies

  • All critical alerts have linked runbooks (WAF-REL-060)

  • Post-incident reviews for all SEV1/SEV2 incidents documented

  • DR plan documented with RTO/RPO targets (WAF-REL-070)

  • DR test conducted at least once with results documented

  • Dependency inventory with criticality assessment (WAF-REL-080)

  • Reliability Debt Register introduced (WAF-REL-100)

Self-Assessment Checklist: Level 4

Reaching level 4 requires (additive to Level 3):

  • Error budget policy documented and followed (WAF-REL-010)

  • Synthetic monitoring for all production endpoints (WAF-REL-020)

  • AZ failover tests conducted semi-annually (WAF-REL-030)

  • Automated monthly backup restore test (WAF-REL-040)

  • Bulkhead isolation for different dependency classes (WAF-REL-050)

  • Automated incident diagnostic data collection on alert (WAF-REL-060)

  • DR procedures automated via IaC; quarterly test (WAF-REL-070)

  • Dependency SLA compliance monitored in real time (WAF-REL-080)

  • Structured chaos experiments with documentation quarterly (WAF-REL-090)

  • Reliability Debt Register integrated into architecture governance (WAF-REL-100)

Recommended Entry Path

From Stage To Stage Recommended Actions (3–6 Months)

1 → 2

Document

SLO workshop, retrofit health checks, configure on-call, enable backups

2 → 3

Enforce

IaC refactoring (Multi-AZ, Circuit Breaker), test backup restore, write runbooks

3 → 4

Measure

Activate error budget, MTTR dashboard, automate DR, start chaos program

4 → 5

Self-Heal

AIOps integration, continuous chaos validation, adaptive SLOs, GameDay culture

The greatest leverage is typically at the jump from level 2 to level 3: IaC-enforced controls are the most effective investment in reliability, as they permanently prevent regressions.