Maturity Model: Reliability

The Reliability maturity model describes five stages from a chaotic initial state to self-healing, adaptive infrastructure. It serves as a self-assessment tool and roadmap for structured improvement.

Overview of the 5 Stages

Stage	Name	Characteristics
1	Chaotic	No SLOs, no health checks, single-AZ, no backups tested. Incidents handled reactively. Reliability is not measured.
2	Documented	SLOs documented, basic monitoring, backups configured, on-call in place – but little of it tested or automated.
3	Enforced	IaC-enforced controls: Multi-AZ, Health Checks, Circuit Breakers. Backups tested, DR plan in place, runbooks linked for critical alerts.
4	Measured	SLOs with error budget, MTTR tracked, chaos tests quarterly, automated DR runbooks, Reliability Debt Register actively managed.
5	Self-Healing	Automatic remediation, adaptive SLOs, continuous chaos validation, GameDay culture, reliability as a board-level metric.

Stage

Name

Characteristics

Chaotic

No SLOs, no health checks, single-AZ, no backups tested. Incidents handled reactively. Reliability is not measured.

Documented

SLOs documented, basic monitoring, backups configured, on-call in place – but little of it tested or automated.

Enforced

IaC-enforced controls: Multi-AZ, Health Checks, Circuit Breakers. Backups tested, DR plan in place, runbooks linked for critical alerts.

Measured

SLOs with error budget, MTTR tracked, chaos tests quarterly, automated DR runbooks, Reliability Debt Register actively managed.

Self-Healing

Automatic remediation, adaptive SLOs, continuous chaos validation, GameDay culture, reliability as a board-level metric.

Per-Control Maturity Overview

Control	L1	L2	L3	L4	L5
REL-010 – SLO & SLA	No SLOs	Documented	Monitored + Alerts	Error Budget Active	Adaptive + Predictive
REL-020 – Health Checks	None	LB Checks	ReadinessProbe + LivenessProbe	StartupProbe + Deep Checks	Synthetic Monitoring
REL-030 – Multi-AZ	Single-AZ	DB Multi-AZ	Everything Multi-AZ	Auto-Failover Tested	Multi-Region
REL-040 – Backup & Recovery	No Backups	Backups present, untested	PITR + Restore Test	Automated Monthly	WORM + CDP
REL-050 – Circuit Breaker	No Timeouts	Timeouts Configured	Circuit Breaker + Retry	Bulkheads + Service Mesh	Adaptive Thresholds
REL-060 – Incident Response	Ad-hoc	Process Documented	Runbooks Linked, MTTR Tracked	Automated Triage	Self-Healing + AIOps
REL-070 – DR Testing	No DR Plan	Annual Test	Semi-Annual Documented	Quarterly Automated	GameDay + Continuous
REL-080 – Dependency Resilience	No Inventory	Basic Inventory	Classified + CB	Auto-Discovery + Monitoring	Proactive Risk Management
REL-090 – Chaos Engineering	No Chaos	Ad-hoc Tests	Structured + Documented	Production Chaos Controlled	Continuous + ML
REL-100 – Reliability Debt	No Tracking	Ad-hoc Notes	Formal Register + Quarterly Review	Integrated in ADRs	Automated Detection

Control

REL-010 – SLO & SLA

No SLOs

Documented

Monitored + Alerts

Error Budget Active

Adaptive + Predictive

REL-020 – Health Checks

None

LB Checks

ReadinessProbe + LivenessProbe

StartupProbe + Deep Checks

Synthetic Monitoring

REL-030 – Multi-AZ

Single-AZ

DB Multi-AZ

Everything Multi-AZ

Auto-Failover Tested

Multi-Region

REL-040 – Backup & Recovery

No Backups

Backups present, untested

PITR + Restore Test

Automated Monthly

WORM + CDP

REL-050 – Circuit Breaker

No Timeouts

Timeouts Configured

Circuit Breaker + Retry

Bulkheads + Service Mesh

Adaptive Thresholds

REL-060 – Incident Response

Ad-hoc

Process Documented

Runbooks Linked, MTTR Tracked

Automated Triage

Self-Healing + AIOps

REL-070 – DR Testing

No DR Plan

Annual Test

Semi-Annual Documented

Quarterly Automated

GameDay + Continuous

REL-080 – Dependency Resilience

No Inventory

Basic Inventory

Classified + CB

Auto-Discovery + Monitoring

Proactive Risk Management

REL-090 – Chaos Engineering

No Chaos

Ad-hoc Tests

Structured + Documented

Production Chaos Controlled

Continuous + ML

REL-100 – Reliability Debt

No Tracking

Ad-hoc Notes

Formal Register + Quarterly Review

Integrated in ADRs

Automated Detection

Self-Assessment Checklist: Level 2

Reaching level 2 requires:

Availability SLO for all production workloads documented (WAF-REL-010)
Latency SLO (p99) for all HTTP services documented
At least one monitoring dashboard per service
ALB/NLB health checks configured (WAF-REL-020)
RDS or equivalent database in Multi-AZ mode (WAF-REL-030)
Automated backups for all production databases enabled (WAF-REL-040)
Backup retention period >= 7 days
On-call rotation configured (WAF-REL-060)
Severity definitions (SEV1–SEV4) documented
Basic dependency inventory in place (WAF-REL-080)

Self-Assessment Checklist: Level 3

Reaching level 3 requires (additive to Level 2):

SLO monitoring with automatic alerting configured (WAF-REL-010)
Error budget burn rate alerts configured
readinessProbe and livenessProbe for all Kubernetes workloads (WAF-REL-020)
All compute resources distributed across at least 2 AZs (WAF-REL-030)
PITR for all production databases enabled (WAF-REL-040)
Backup restore tested and documented at least once
Explicit timeouts for all outgoing HTTP calls (WAF-REL-050)
Circuit breaker configured for all critical dependencies
All critical alerts have linked runbooks (WAF-REL-060)
Post-incident reviews for all SEV1/SEV2 incidents documented
DR plan documented with RTO/RPO targets (WAF-REL-070)
DR test conducted at least once with results documented
Dependency inventory with criticality assessment (WAF-REL-080)
Reliability Debt Register introduced (WAF-REL-100)

Self-Assessment Checklist: Level 4

Reaching level 4 requires (additive to Level 3):

Error budget policy documented and followed (WAF-REL-010)
Synthetic monitoring for all production endpoints (WAF-REL-020)
AZ failover tests conducted semi-annually (WAF-REL-030)
Automated monthly backup restore test (WAF-REL-040)
Bulkhead isolation for different dependency classes (WAF-REL-050)
Automated incident diagnostic data collection on alert (WAF-REL-060)
DR procedures automated via IaC; quarterly test (WAF-REL-070)
Dependency SLA compliance monitored in real time (WAF-REL-080)
Structured chaos experiments with documentation quarterly (WAF-REL-090)
Reliability Debt Register integrated into architecture governance (WAF-REL-100)

Recommended Entry Path

From Stage	To Stage	Recommended Actions (3–6 Months)
1 → 2	Document	SLO workshop, retrofit health checks, configure on-call, enable backups
2 → 3	Enforce	IaC refactoring (Multi-AZ, Circuit Breaker), test backup restore, write runbooks
3 → 4	Measure	Activate error budget, MTTR dashboard, automate DR, start chaos program
4 → 5	Self-Heal	AIOps integration, continuous chaos validation, adaptive SLOs, GameDay culture

From Stage

To Stage

Recommended Actions (3–6 Months)

1 → 2

Document

SLO workshop, retrofit health checks, configure on-call, enable backups

2 → 3

Enforce

IaC refactoring (Multi-AZ, Circuit Breaker), test backup restore, write runbooks

3 → 4

Measure

Activate error budget, MTTR dashboard, automate DR, start chaos program

4 → 5

Self-Heal

AIOps integration, continuous chaos validation, adaptive SLOs, GameDay culture

The greatest leverage is typically at the jump from level 2 to level 3: IaC-enforced controls are the most effective investment in reliability, as they permanently prevent regressions.