WAF++ WAF++
Back to WAF++ Homepage

Reliability (Pillar: Reliability)

The Reliability pillar of WAF++ defines requirements, principles and measurable controls to operate cloud workloads in a resilient, recoverable and demonstrably available manner.

Reliability is not accidental. It is an architecture outcome achieved through measurable goals, technical enforcement and continuous testing – not through hope.

What does Reliability mean in WAF++?

Reliability means that an organization has demonstrable control over the following dimensions:

Dimension What is controlled? WAF-REL Control

SLO & SLA Governance

Are availability and latency targets documented, measured and covered by alerts?

WAF-REL-010

Health Monitoring

Are health checks and readiness probes configured for all services?

WAF-REL-020

High Availability

Are all production workloads distributed across at least 2 Availability Zones?

WAF-REL-030

Backup & Recovery

Are automated backups configured and recovery procedures demonstrably tested?

WAF-REL-040

Resilience Patterns

Are circuit breakers, timeouts and retry logic configured for all dependencies?

WAF-REL-050

Incident Response

Are documented runbooks, on-call rotation and MTTR tracking in place?

WAF-REL-060

Disaster Recovery Testing

Are DR tests conducted at least twice a year and documented?

WAF-REL-070

Dependency Resilience

Are all critical dependencies inventoried and equipped with fallback behavior?

WAF-REL-080

Chaos Engineering

Are structured chaos experiments used to validate resilience claims?

WAF-REL-090

Reliability Debt

Are known reliability debts documented, assessed and provided with a remediation plan?

WAF-REL-100

Why is Reliability its own pillar?

Reliability is cross-cutting: it emerges from Security, Operations, Architecture and Governance. Nevertheless, Reliability is an independent discipline because:

  • It has its own measurement dimension: SLOs, MTTR, RTO/RPO, Error Budget

  • It requires specific technical controls that no other pillar covers

  • It addresses reliability debt as a structural risk – analogous to technical debt

  • Reliability must be anchored as a strategic basis for decision-making in architecture processes

  • Brownfield and greenfield scenarios require fundamentally different approaches

Reliability without measurement is wishful thinking. Backups without restore tests are untested hopes. Multi-AZ without a failover test is an architectural claim, not a proven guarantee.

Demarcation from other pillars

  • Security addresses: access control, encryption, incident response from a security perspective.

  • Operations addresses: change management, deployment processes, operational excellence.

  • Architecture addresses: system design, patterns, quality of technical decisions.

  • Reliability addresses: measurable availability, recoverability, resilience against failures.

Reliability presupposes that infrastructure exists and is monitored, and extends this with fault tolerance, recovery capacity, resilience patterns and structured failure management.

Controls Overview

The Reliability pillar is operationalized by 10 measurable controls (WAF-REL-010 to WAF-REL-100).

Control ID Title Severity Automatable

WAF-REL-010

SLA & SLO Definition Documented

Critical

Medium

WAF-REL-020

Health Checks & Readiness Probes Configured

High

High

WAF-REL-030

Multi-AZ High Availability Deployment

High

High

WAF-REL-040

Backup & Recovery Validation

Critical

High

WAF-REL-050

Circuit Breaker & Timeout Configuration

High

High

WAF-REL-060

Incident Response & Runbook Readiness

High

Medium

WAF-REL-070

Disaster Recovery Testing

High

Partial

WAF-REL-080

Dependency & Upstream Resilience Management

Medium

Medium

WAF-REL-090

Chaos Engineering & Fault Injection

Medium

Medium

WAF-REL-100

Reliability Debt Register & Quarterly Review

Medium

Low–Medium

Quick Start

New to the Reliability pillar? Recommended reading order:

  1. Definition – What is Reliability as a discipline?

  2. Scope – Brownfield vs. Greenfield, what is in scope?

  3. Reliability Principles – 7 core principles

  4. Design Principles – 8 technical architecture principles

  5. Controls – The 10 measurable controls

  6. Maturity Model – Where does my organization stand?

  7. Best Practices – How to implement it concretely?