Scope: Reliability
What is in Scope?
The Reliability pillar addresses the following subject areas:
SLO & SLA Governance
-
Definition and documentation of Service Level Objectives (SLOs)
-
Measurement of availability, latency, error rate and throughput
-
Error budget management and burn rate alerting
-
SLA agreements with internal and external customers
High Availability
-
Multi-AZ deployment for all production workloads
-
Automatic failover for databases and stateful services
-
Load balancers with cross-AZ configuration
-
Kubernetes pod distribution across availability zones
Health Monitoring
-
Health check endpoints for all services
-
Readiness and liveness probes (Kubernetes)
-
Load balancer health checks with explicit thresholds
-
Synthetic monitoring for external availability validation
Backup & Recovery
-
Automated backup configuration with defined retention periods
-
Point-in-Time Recovery (PITR) for databases
-
Cross-account/cross-region backup storage
-
Tested and documented recovery procedures
Resilience Patterns
-
Circuit breakers for all synchronous dependencies
-
Timeout configuration for all outgoing calls
-
Retry logic with exponential backoff and jitter
-
Bulkhead isolation for different dependency classes
Incident Response
-
Severity classification and escalation paths
-
Runbooks for all critical alerts
-
On-call rotation and notification configuration
-
Post-incident reviews and action item tracking
Disaster Recovery Testing
-
Documented DR plans with RTO/RPO targets
-
At least two DR tests per year
-
Results documentation with actual RTO/RPO achieved
-
Automated DR procedures via IaC
What is NOT in Scope?
-
Security Incident Response: Security incidents fall under the Security pillar
-
Performance Tuning: Latency optimization under nominal load is Performance Efficiency
-
Deployment Pipelines: CI/CD processes are in Operations
-
Data Protection: GDPR compliance, data categorization → Sovereign pillar
-
Network Security: Firewall rules, VPN configuration → Security pillar
-
Cost Optimization: Even though reliability has costs, TCO is in the Cost pillar
Brownfield vs. Greenfield
Greenfield (New Development)
For new development, reliability can be built in from the start:
| Phase | Reliability Requirement |
|---|---|
Concept |
SLO definition, RTO/RPO decision, dependency assessment |
Design |
Multi-AZ architecture, circuit breaker design, backup strategy |
Implementation |
IaC with all WAF-REL controls from the beginning; health checks in code |
Go-Live |
DR test before first production loads; chaos test in staging passed |
Operations |
Quarterly DR tests, chaos experiments, SLO review cycle |
Brownfield (Existing Systems)
For existing systems, a risk-based approach is recommended:
-
Inventory: Identify all production workloads, classify by criticality
-
SLO Baseline: Measure current availability to know the starting point
-
Quick Wins: Health checks and alerting can be retrofitted quickly (1–2 sprints)
-
Critical Systems First: Multi-AZ and backup tests for the most critical systems
-
Document Debt: Record known gaps in the Reliability Debt Register
-
Improve Iteratively: Quarterly review cycle for structured improvement
| Brownfield systems without a DR test are the most common risk. Start with a single-service restore test before planning more complex tests. |
Reliability Drivers
| Driver | Description | WAF-REL Controls |
|---|---|---|
Customer Commitments |
External SLAs require demonstrable availability |
REL-010, REL-020, REL-030 |
Regulatory Requirements |
ISO 27001, GDPR, BSI C5 require demonstrable recovery capacity |
REL-040, REL-060, REL-070 |
Cost Risk |
Unplanned outages cost more than preventive reliability investments |
REL-030, REL-040, REL-100 |
Engineering Productivity |
High toil from reactive incident response is reliability debt |
REL-060, REL-090, REL-100 |
Growth Scaling |
Systems that cannot handle 10x load block business growth |
REL-020, REL-050, REL-080 |
Partner Integration |
B2B integrations require measurable availability and incident communication |
REL-010, REL-060, REL-070 |