Design Principles: Reliability Architecture
The following eight technical design principles (RD1–RD8) translate the 7 Reliability Principles into concrete architecture requirements.
RD1 – Idempotent Operations
All mutating operations (write, delete, update) MUST be idempotent. An operation executed multiple times MUST produce the same result as a single execution.
Why: Retry logic and automatic recovery require that failed operations can be safely retried. Non-idempotent operations lead to data duplication or inconsistent states on retries.
Implementation: * Unique request IDs for all API calls (idempotency keys) * Upsert semantics instead of separate create/update paths * Optimistic locking with version/ETag validation
RD2 – Stateless Services (Stateless First)
Services SHOULD be stateless. State MUST be held in dedicated data stores, not in service instances.
Why: Stateless services can be restarted, scaled and replaced without data loss. This is the prerequisite for auto-healing (RP3) and horizontal scaling.
Implementation: * Sessions in Redis/Memcached or JWT-based (no local in-memory state) * Uploads directly to object storage (S3, GCS, Azure Blob), not local filesystem * Feature flags from a central service, not from local config files
RD3 – Graceful Degradation Instead of Complete Failure
Services MUST remain functional when non-critical dependencies fail, even if with reduced feature set.
Why: Complete outages on failure of optional features violate the blast radius principle (RP5). Users accept reduced functionality far better than system outages.
Implementation: * Feature flags disable non-critical features on dependency failures * Cached response as fallback on cache miss with dependency failure * Stub responses for optional enrichment services
RD4 – Bulkhead Isolation of Resource Pools
Different dependency classes MUST use separate connection pools, thread pools and queue capacities.
Why: Without bulkheads, a slow optional service can exhaust all available threads and block critical services (cascading failure).
Implementation: * Separate Resilience4j/Hystrix bulkheads per dependency class * Separate HTTP client instances per external API * Separate connection pool configuration per database instance
RD5 – Immutable Infrastructure
Production infrastructure SHOULD be updated by replacement, not modification.
Why: Mutable infrastructure accumulates configuration drift, which impairs reliability and complicates recovery. Immutable infrastructure enables deterministic behavior and reproducible deployments.
Implementation: * No SSH/RDP into production instances for configuration changes * All configuration changes via IaC deploy, not in-place * Container images immutable: no modifying running containers
RD6 – Defense in Depth for Data Persistence
Data persistence MUST have multiple independent protection layers: Multi-AZ, backup, cross-region replication and PITR.
Why: Individual protection measures fail. The combination of multiple independent layers reduces the risk of complete data loss to near zero.
Implementation: * Multi-AZ primary database (synchronous replication, automatic failover) * Daily backups in separate account/region (asynchronous replication) * PITR for recovery to transaction-precise point in time * Backup immutability via WORM or Vault Lock
RD7 – Observability Before Intervention
No production system MAY be deployed without logs, metrics and traces being available for incident investigation.
Why: Recovery without observability is flying blind. MTTR increases dramatically when diagnostic data is missing or must first be configured after the incident has started.
Implementation: * Structured JSON logs with correlated request ID * RED metrics: Rate, Errors, Duration for each service * Distributed tracing (OpenTelemetry) for inter-service dependencies * SLO dashboard available before go-live
RD8 – Chaos-Ready by Design
New services SHOULD be designed so that chaos tests can be easily conducted.
Why: Services that are difficult to test are not tested. Chaos testability is a design quality characteristic that must be considered from the beginning.
Implementation: * Chaos endpoints (feature-flag-controlled) for fault injection in staging * Explicit timeout configuration (no hard-coded values) * Health check endpoints that return real dependency status * Stop conditions documented for all automated chaos tools