Reliability Principles
The Reliability pillar is based on seven core principles (RP1–RP7). Each principle is not an implementation detail, but an architectural stance that underpins all reliability controls and best practices.
RP1 – Measure First
Explanation
Reliability investments without measurable goals lead to activities without impact. Before a team deploys Multi-AZ, configures circuit breakers or runs chaos tests, it must know: What is the current reliability level? What is the acceptable minimum? What are the expectations of users and stakeholders?
SLOs are the foundation. Without SLOs, an organization cannot decide whether it is over-engineered (investing too much in reliability) or under-engineered (investing too little). Error Budgets transform SLOs from static goals into dynamic decision-making bases.
RP2 – Design for Failure
Explanation
In cloud environments, hardware failures, AZ disruptions, network interruptions and software bugs are not exceptions – they are normal states that occur statistically. A system that responds to the failure of one component with a complete outage was not designed for reliability.
Reliability design means: The failure of any single component may only impair overall availability within defined limits. Multi-AZ deployment is the implementation of this principle for AZ failures. Circuit breakers are the implementation for dependency failures.
RP3 – Automate Recovery
Explanation
Human response times (MTTR through manual intervention) are too slow for high-availability services in modern, distributed systems. Auto-healing mechanisms – health check-based instance replacement, Kubernetes pod restart, auto-scaling under load spikes – reduce MTTR from minutes to seconds.
Automated recovery requires: valid health checks (RP1), clearly defined failure states, idempotent restart procedures and well-configured infrastructure-as-code. Automation is only safe if the automation itself is test-certified.
RP4 – Test Everything
Explanation
Every reliability claim – "we are Multi-AZ", "our backup works", "the circuit breaker protects us" – is an assertion without a corresponding test. DR tests, chaos engineering and restore exercises are not nice-to-haves for advanced teams. They are the only method to convert reliability claims into evidence.
Untested systems fail differently than design reviews predict. Chaos Engineering as a systematic practice is the difference between a team that believes it is resilient and a team that knows it is.
RP5 – Limit Blast Radius
Explanation
Blast radius describes the extent of damage that a single failure can cause. Cascading failures occur when a failure in one component has unlimited access to the resources of other components: thread pools, connection pools, request queues.
Blast radius limitation is the goal of bulkheads (isolation of resource pools), circuit breakers (fast-fail instead of resource drain), feature flags (selective deactivation) and AZ isolation (geographic fault boundary).
RP6 – Eliminate Single Points of Failure
Explanation
A Single Point of Failure (SPOF) is a component whose failure leads to a complete outage or an SLO violation. In cloud architectures, SPOFs frequently arise from: single-AZ deployments, single-instance databases without failover, shared config endpoints, external dependencies without fallback.
Systematically identifying and eliminating SPOFs requires an architectural analysis (Failure Mode and Effects Analysis, FMEA), which the Reliability Debt Register (WAF-REL-100) documents and empirically validates through Chaos Engineering (WAF-REL-090).
RP7 – Reliability as Architecture Concern
Explanation
Reliability cannot be retrofitted into a poorly designed system. Decisions about multi-AZ, database failover, circuit breaker design and recovery strategy are made – or missed – in the architecture process. Operations teams can compensate for the impact through runbooks and incident response, but cannot eliminate architectural debt.
Reliability must be anchored as an explicit requirement in Architecture Decision Records (ADRs), design reviews and sprint planning. Architecture decisions that reduce reliability must be documented in the Reliability Debt Register.
Implications
-
Architecture reviews include a reliability assessment as a mandatory item
-
ADRs document reliability implications explicitly
-
Reliability debt from architecture decisions is registered in REL-100