WAF++ WAF++
Back to WAF++ Homepage

Scope: Reliability

What is in Scope?

The Reliability pillar addresses the following subject areas:

SLO & SLA Governance

  • Definition and documentation of Service Level Objectives (SLOs)

  • Measurement of availability, latency, error rate and throughput

  • Error budget management and burn rate alerting

  • SLA agreements with internal and external customers

High Availability

  • Multi-AZ deployment for all production workloads

  • Automatic failover for databases and stateful services

  • Load balancers with cross-AZ configuration

  • Kubernetes pod distribution across availability zones

Health Monitoring

  • Health check endpoints for all services

  • Readiness and liveness probes (Kubernetes)

  • Load balancer health checks with explicit thresholds

  • Synthetic monitoring for external availability validation

Backup & Recovery

  • Automated backup configuration with defined retention periods

  • Point-in-Time Recovery (PITR) for databases

  • Cross-account/cross-region backup storage

  • Tested and documented recovery procedures

Resilience Patterns

  • Circuit breakers for all synchronous dependencies

  • Timeout configuration for all outgoing calls

  • Retry logic with exponential backoff and jitter

  • Bulkhead isolation for different dependency classes

Incident Response

  • Severity classification and escalation paths

  • Runbooks for all critical alerts

  • On-call rotation and notification configuration

  • Post-incident reviews and action item tracking

Disaster Recovery Testing

  • Documented DR plans with RTO/RPO targets

  • At least two DR tests per year

  • Results documentation with actual RTO/RPO achieved

  • Automated DR procedures via IaC

Chaos Engineering

  • Hypothesis-driven fault injection tests

  • Structured chaos experiments (AWS FIS, Azure Chaos Studio)

  • GameDay events for holistic resilience tests

  • Continuous chaos validation in staging

Dependency & Reliability Debt

  • Inventory of all critical dependencies

  • Reliability Debt Register with priority and owner

  • Quarterly review process

What is NOT in Scope?

  • Security Incident Response: Security incidents fall under the Security pillar

  • Performance Tuning: Latency optimization under nominal load is Performance Efficiency

  • Deployment Pipelines: CI/CD processes are in Operations

  • Data Protection: GDPR compliance, data categorization → Sovereign pillar

  • Network Security: Firewall rules, VPN configuration → Security pillar

  • Cost Optimization: Even though reliability has costs, TCO is in the Cost pillar

Brownfield vs. Greenfield

Greenfield (New Development)

For new development, reliability can be built in from the start:

Phase Reliability Requirement

Concept

SLO definition, RTO/RPO decision, dependency assessment

Design

Multi-AZ architecture, circuit breaker design, backup strategy

Implementation

IaC with all WAF-REL controls from the beginning; health checks in code

Go-Live

DR test before first production loads; chaos test in staging passed

Operations

Quarterly DR tests, chaos experiments, SLO review cycle

Brownfield (Existing Systems)

For existing systems, a risk-based approach is recommended:

  1. Inventory: Identify all production workloads, classify by criticality

  2. SLO Baseline: Measure current availability to know the starting point

  3. Quick Wins: Health checks and alerting can be retrofitted quickly (1–2 sprints)

  4. Critical Systems First: Multi-AZ and backup tests for the most critical systems

  5. Document Debt: Record known gaps in the Reliability Debt Register

  6. Improve Iteratively: Quarterly review cycle for structured improvement

Brownfield systems without a DR test are the most common risk. Start with a single-service restore test before planning more complex tests.

Reliability Drivers

Driver Description WAF-REL Controls

Customer Commitments

External SLAs require demonstrable availability

REL-010, REL-020, REL-030

Regulatory Requirements

ISO 27001, GDPR, BSI C5 require demonstrable recovery capacity

REL-040, REL-060, REL-070

Cost Risk

Unplanned outages cost more than preventive reliability investments

REL-030, REL-040, REL-100

Engineering Productivity

High toil from reactive incident response is reliability debt

REL-060, REL-090, REL-100

Growth Scaling

Systems that cannot handle 10x load block business growth

REL-020, REL-050, REL-080

Partner Integration

B2B integrations require measurable availability and incident communication

REL-010, REL-060, REL-070