WAF++ WAF++
Back to WAF++ Homepage

Definition: Reliability as a Discipline

What is Reliability?

Reliability refers to the ability of a system to reliably perform its agreed function within defined parameters over a specified period of time – even under fault and failure conditions.

In the cloud context, this means specifically:

  • The system is available when users need it

  • Failures are tolerated, not prevented

  • Outages are detected, isolated and resolved – automatically where possible

  • Data can be recovered in the event of a failure

  • Recovery procedures are demonstrably functional

The Reliability Spectrum

Organizations typically progress through five maturity stages on the path to full reliability:

Stage Name Characteristics

1

Chaotic

Failures unknown, no goals, reactive remediation, no documentation

2

Documented

SLOs in place, backups configured, processes described – but untested

3

Enforced

Multi-AZ, Health Checks, Circuit Breakers enforced in IaC; backups tested

4

Measured

SLOs with Error Budget, MTTR tracked, chaos tests quarterly, DR tested

5

Self-Healing

Automatic remediation, adaptive SLOs, continuous chaos validation

What Reliability is NOT

Clear demarcation from adjacent disciplines:

Not Security

Security deals with who is allowed to access systems and how data is protected. Reliability deals with whether systems function and recover from failures. A system can be secure but unreliable – and vice versa.

Not Operations

Operations deals with how systems are deployed, maintained and changed. Reliability deals with what happens when these operations fail – and how the system handles it.

Not Performance

Performance deals with speed under nominal load. Reliability deals with stability under fault and failure conditions. A fast system can be unreliable if it collapses under load spikes.

Not Monitoring Alone

Monitoring is a tool for Reliability. But monitoring without SLOs, runbooks and incident response is merely data collection without consequence.

Reliability in the WAF++ Context

WAF++ treats Reliability as an independent pillar because:

  1. Measurability requires its own controls: SLOs, RTO/RPO, MTTR, Error Budget have no equivalent in other pillars

  2. Reliability debt is structural: Deferred resilience improvements accumulate like technical debt and become invisible without tracking

  3. Testing is fundamental: Reliability without regular chaos and DR tests is an unvalidated claim

  4. Dependencies limit reliability: The reliability of a system is bounded by the weakest critical dependency

Interaction with Other Pillars

Pillar Interaction

Security

Incident response processes overlap; data loss is both a Security and a Reliability event.

Cost

Multi-AZ and DR increase costs; reliability debt must be considered in the cost debt register.

Operations

Deployment processes must integrate reliability tests (Canary, Blue/Green).

Architecture

Architecture decisions must document reliability implications (ADR).

Governance

SLAs are contractual obligations; compliance audits require evidence.

Target Vision

The target vision of the Reliability pillar is an organization where:

  • Every workload has a documented, measured and monitored SLO

  • Failures are planned for and tolerated by design – not attempted to be prevented

  • Recovery demonstrably works through regular tests, not hope

  • Reliability debt is visible and actively managed

  • Chaos Engineering is part of normal engineering practice, not the exception

  • Every significant architecture decision documents the reliability implication

A system that achieves this target vision can back SLA commitments to customers, regulators and internal stakeholders with empirical evidence – not just architectural claims.