Definition: Reliability as a Discipline
What is Reliability?
Reliability refers to the ability of a system to reliably perform its agreed function within defined parameters over a specified period of time – even under fault and failure conditions.
In the cloud context, this means specifically:
-
The system is available when users need it
-
Failures are tolerated, not prevented
-
Outages are detected, isolated and resolved – automatically where possible
-
Data can be recovered in the event of a failure
-
Recovery procedures are demonstrably functional
The Reliability Spectrum
Organizations typically progress through five maturity stages on the path to full reliability:
| Stage | Name | Characteristics |
|---|---|---|
1 |
Chaotic |
Failures unknown, no goals, reactive remediation, no documentation |
2 |
Documented |
SLOs in place, backups configured, processes described – but untested |
3 |
Enforced |
Multi-AZ, Health Checks, Circuit Breakers enforced in IaC; backups tested |
4 |
Measured |
SLOs with Error Budget, MTTR tracked, chaos tests quarterly, DR tested |
5 |
Self-Healing |
Automatic remediation, adaptive SLOs, continuous chaos validation |
What Reliability is NOT
Clear demarcation from adjacent disciplines:
Not Security
Security deals with who is allowed to access systems and how data is protected. Reliability deals with whether systems function and recover from failures. A system can be secure but unreliable – and vice versa.
Not Operations
Operations deals with how systems are deployed, maintained and changed. Reliability deals with what happens when these operations fail – and how the system handles it.
Reliability in the WAF++ Context
WAF++ treats Reliability as an independent pillar because:
-
Measurability requires its own controls: SLOs, RTO/RPO, MTTR, Error Budget have no equivalent in other pillars
-
Reliability debt is structural: Deferred resilience improvements accumulate like technical debt and become invisible without tracking
-
Testing is fundamental: Reliability without regular chaos and DR tests is an unvalidated claim
-
Dependencies limit reliability: The reliability of a system is bounded by the weakest critical dependency
Interaction with Other Pillars
| Pillar | Interaction |
|---|---|
Security |
Incident response processes overlap; data loss is both a Security and a Reliability event. |
Cost |
Multi-AZ and DR increase costs; reliability debt must be considered in the cost debt register. |
Operations |
Deployment processes must integrate reliability tests (Canary, Blue/Green). |
Architecture |
Architecture decisions must document reliability implications (ADR). |
Governance |
SLAs are contractual obligations; compliance audits require evidence. |
Target Vision
The target vision of the Reliability pillar is an organization where:
-
Every workload has a documented, measured and monitored SLO
-
Failures are planned for and tolerated by design – not attempted to be prevented
-
Recovery demonstrably works through regular tests, not hope
-
Reliability debt is visible and actively managed
-
Chaos Engineering is part of normal engineering practice, not the exception
-
Every significant architecture decision documents the reliability implication
A system that achieves this target vision can back SLA commitments to customers, regulators and internal stakeholders with empirical evidence – not just architectural claims.