Definition: Reliability as a Discipline

What is Reliability?

Reliability refers to the ability of a system to reliably perform its agreed function within defined parameters over a specified period of time – even under fault and failure conditions.

In the cloud context, this means specifically:

The system is available when users need it
Failures are tolerated, not prevented
Outages are detected, isolated and resolved – automatically where possible
Data can be recovered in the event of a failure
Recovery procedures are demonstrably functional

The Reliability Spectrum

Organizations typically progress through five maturity stages on the path to full reliability:

Stage	Name	Characteristics
1	Chaotic	Failures unknown, no goals, reactive remediation, no documentation
2	Documented	SLOs in place, backups configured, processes described – but untested
3	Enforced	Multi-AZ, Health Checks, Circuit Breakers enforced in IaC; backups tested
4	Measured	SLOs with Error Budget, MTTR tracked, chaos tests quarterly, DR tested
5	Self-Healing	Automatic remediation, adaptive SLOs, continuous chaos validation

Stage

Name

Characteristics

Chaotic

Failures unknown, no goals, reactive remediation, no documentation

Documented

SLOs in place, backups configured, processes described – but untested

Enforced

Multi-AZ, Health Checks, Circuit Breakers enforced in IaC; backups tested

Measured

SLOs with Error Budget, MTTR tracked, chaos tests quarterly, DR tested

Self-Healing

Automatic remediation, adaptive SLOs, continuous chaos validation

What Reliability is NOT

Clear demarcation from adjacent disciplines:

Not Security

Security deals with who is allowed to access systems and how data is protected. Reliability deals with whether systems function and recover from failures. A system can be secure but unreliable – and vice versa.

Not Operations

Operations deals with how systems are deployed, maintained and changed. Reliability deals with what happens when these operations fail – and how the system handles it.

Not Performance

Performance deals with speed under nominal load. Reliability deals with stability under fault and failure conditions. A fast system can be unreliable if it collapses under load spikes.

Not Monitoring Alone

Monitoring is a tool for Reliability. But monitoring without SLOs, runbooks and incident response is merely data collection without consequence.

Reliability in the WAF++ Context

WAF++ treats Reliability as an independent pillar because:

Measurability requires its own controls: SLOs, RTO/RPO, MTTR, Error Budget have no equivalent in other pillars
Reliability debt is structural: Deferred resilience improvements accumulate like technical debt and become invisible without tracking
Testing is fundamental: Reliability without regular chaos and DR tests is an unvalidated claim
Dependencies limit reliability: The reliability of a system is bounded by the weakest critical dependency

Interaction with Other Pillars

Pillar	Interaction
Security	Incident response processes overlap; data loss is both a Security and a Reliability event.
Cost	Multi-AZ and DR increase costs; reliability debt must be considered in the cost debt register.
Operations	Deployment processes must integrate reliability tests (Canary, Blue/Green).
Architecture	Architecture decisions must document reliability implications (ADR).
Governance	SLAs are contractual obligations; compliance audits require evidence.

Pillar

Interaction

Security

Incident response processes overlap; data loss is both a Security and a Reliability event.

Cost

Multi-AZ and DR increase costs; reliability debt must be considered in the cost debt register.

Operations

Deployment processes must integrate reliability tests (Canary, Blue/Green).

Architecture

Architecture decisions must document reliability implications (ADR).

Governance

SLAs are contractual obligations; compliance audits require evidence.

Target Vision

The target vision of the Reliability pillar is an organization where:

Every workload has a documented, measured and monitored SLO
Failures are planned for and tolerated by design – not attempted to be prevented
Recovery demonstrably works through regular tests, not hope
Reliability debt is visible and actively managed
Chaos Engineering is part of normal engineering practice, not the exception
Every significant architecture decision documents the reliability implication

A system that achieves this target vision can back SLA commitments to customers, regulators and internal stakeholders with empirical evidence – not just architectural claims.