Maturity Model: Reliability
The Reliability maturity model describes five stages from a chaotic initial state to self-healing, adaptive infrastructure. It serves as a self-assessment tool and roadmap for structured improvement.
Overview of the 5 Stages
| Stage | Name | Characteristics |
|---|---|---|
1 |
Chaotic |
No SLOs, no health checks, single-AZ, no backups tested. Incidents handled reactively. Reliability is not measured. |
2 |
Documented |
SLOs documented, basic monitoring, backups configured, on-call in place – but little of it tested or automated. |
3 |
Enforced |
IaC-enforced controls: Multi-AZ, Health Checks, Circuit Breakers. Backups tested, DR plan in place, runbooks linked for critical alerts. |
4 |
Measured |
SLOs with error budget, MTTR tracked, chaos tests quarterly, automated DR runbooks, Reliability Debt Register actively managed. |
5 |
Self-Healing |
Automatic remediation, adaptive SLOs, continuous chaos validation, GameDay culture, reliability as a board-level metric. |
Per-Control Maturity Overview
| Control | L1 | L2 | L3 | L4 | L5 |
|---|---|---|---|---|---|
No SLOs |
Documented |
Monitored + Alerts |
Error Budget Active |
Adaptive + Predictive |
|
None |
LB Checks |
ReadinessProbe + LivenessProbe |
StartupProbe + Deep Checks |
Synthetic Monitoring |
|
Single-AZ |
DB Multi-AZ |
Everything Multi-AZ |
Auto-Failover Tested |
Multi-Region |
|
No Backups |
Backups present, untested |
PITR + Restore Test |
Automated Monthly |
WORM + CDP |
|
No Timeouts |
Timeouts Configured |
Circuit Breaker + Retry |
Bulkheads + Service Mesh |
Adaptive Thresholds |
|
Ad-hoc |
Process Documented |
Runbooks Linked, MTTR Tracked |
Automated Triage |
Self-Healing + AIOps |
|
No DR Plan |
Annual Test |
Semi-Annual Documented |
Quarterly Automated |
GameDay + Continuous |
|
No Inventory |
Basic Inventory |
Classified + CB |
Auto-Discovery + Monitoring |
Proactive Risk Management |
|
No Chaos |
Ad-hoc Tests |
Structured + Documented |
Production Chaos Controlled |
Continuous + ML |
|
No Tracking |
Ad-hoc Notes |
Formal Register + Quarterly Review |
Integrated in ADRs |
Automated Detection |
Self-Assessment Checklist: Level 2
Reaching level 2 requires:
-
Availability SLO for all production workloads documented (WAF-REL-010)
-
Latency SLO (p99) for all HTTP services documented
-
At least one monitoring dashboard per service
-
ALB/NLB health checks configured (WAF-REL-020)
-
RDS or equivalent database in Multi-AZ mode (WAF-REL-030)
-
Automated backups for all production databases enabled (WAF-REL-040)
-
Backup retention period >= 7 days
-
On-call rotation configured (WAF-REL-060)
-
Severity definitions (SEV1–SEV4) documented
-
Basic dependency inventory in place (WAF-REL-080)
Self-Assessment Checklist: Level 3
Reaching level 3 requires (additive to Level 2):
-
SLO monitoring with automatic alerting configured (WAF-REL-010)
-
Error budget burn rate alerts configured
-
readinessProbe and livenessProbe for all Kubernetes workloads (WAF-REL-020)
-
All compute resources distributed across at least 2 AZs (WAF-REL-030)
-
PITR for all production databases enabled (WAF-REL-040)
-
Backup restore tested and documented at least once
-
Explicit timeouts for all outgoing HTTP calls (WAF-REL-050)
-
Circuit breaker configured for all critical dependencies
-
All critical alerts have linked runbooks (WAF-REL-060)
-
Post-incident reviews for all SEV1/SEV2 incidents documented
-
DR plan documented with RTO/RPO targets (WAF-REL-070)
-
DR test conducted at least once with results documented
-
Dependency inventory with criticality assessment (WAF-REL-080)
-
Reliability Debt Register introduced (WAF-REL-100)
Self-Assessment Checklist: Level 4
Reaching level 4 requires (additive to Level 3):
-
Error budget policy documented and followed (WAF-REL-010)
-
Synthetic monitoring for all production endpoints (WAF-REL-020)
-
AZ failover tests conducted semi-annually (WAF-REL-030)
-
Automated monthly backup restore test (WAF-REL-040)
-
Bulkhead isolation for different dependency classes (WAF-REL-050)
-
Automated incident diagnostic data collection on alert (WAF-REL-060)
-
DR procedures automated via IaC; quarterly test (WAF-REL-070)
-
Dependency SLA compliance monitored in real time (WAF-REL-080)
-
Structured chaos experiments with documentation quarterly (WAF-REL-090)
-
Reliability Debt Register integrated into architecture governance (WAF-REL-100)
Recommended Entry Path
| From Stage | To Stage | Recommended Actions (3–6 Months) |
|---|---|---|
1 → 2 |
Document |
SLO workshop, retrofit health checks, configure on-call, enable backups |
2 → 3 |
Enforce |
IaC refactoring (Multi-AZ, Circuit Breaker), test backup restore, write runbooks |
3 → 4 |
Measure |
Activate error budget, MTTR dashboard, automate DR, start chaos program |
4 → 5 |
Self-Heal |
AIOps integration, continuous chaos validation, adaptive SLOs, GameDay culture |
| The greatest leverage is typically at the jump from level 2 to level 3: IaC-enforced controls are the most effective investment in reliability, as they permanently prevent regressions. |