Design Principles: Reliability Architecture

The following eight technical design principles (RD1–RD8) translate the 7 Reliability Principles into concrete architecture requirements.

RD1 – Idempotent Operations

All mutating operations (write, delete, update) MUST be idempotent. An operation executed multiple times MUST produce the same result as a single execution.

Why: Retry logic and automatic recovery require that failed operations can be safely retried. Non-idempotent operations lead to data duplication or inconsistent states on retries.

Implementation: * Unique request IDs for all API calls (idempotency keys) * Upsert semantics instead of separate create/update paths * Optimistic locking with version/ETag validation

RD2 – Stateless Services (Stateless First)

Services SHOULD be stateless. State MUST be held in dedicated data stores, not in service instances.

Why: Stateless services can be restarted, scaled and replaced without data loss. This is the prerequisite for auto-healing (RP3) and horizontal scaling.

Implementation: * Sessions in Redis/Memcached or JWT-based (no local in-memory state) * Uploads directly to object storage (S3, GCS, Azure Blob), not local filesystem * Feature flags from a central service, not from local config files

RD3 – Graceful Degradation Instead of Complete Failure

Services MUST remain functional when non-critical dependencies fail, even if with reduced feature set.

Why: Complete outages on failure of optional features violate the blast radius principle (RP5). Users accept reduced functionality far better than system outages.

Implementation: * Feature flags disable non-critical features on dependency failures * Cached response as fallback on cache miss with dependency failure * Stub responses for optional enrichment services

RD4 – Bulkhead Isolation of Resource Pools

Different dependency classes MUST use separate connection pools, thread pools and queue capacities.

Why: Without bulkheads, a slow optional service can exhaust all available threads and block critical services (cascading failure).

Implementation: * Separate Resilience4j/Hystrix bulkheads per dependency class * Separate HTTP client instances per external API * Separate connection pool configuration per database instance

RD5 – Immutable Infrastructure

Production infrastructure SHOULD be updated by replacement, not modification.

Why: Mutable infrastructure accumulates configuration drift, which impairs reliability and complicates recovery. Immutable infrastructure enables deterministic behavior and reproducible deployments.

Implementation: * No SSH/RDP into production instances for configuration changes * All configuration changes via IaC deploy, not in-place * Container images immutable: no modifying running containers

RD6 – Defense in Depth for Data Persistence

Data persistence MUST have multiple independent protection layers: Multi-AZ, backup, cross-region replication and PITR.

Why: Individual protection measures fail. The combination of multiple independent layers reduces the risk of complete data loss to near zero.

Implementation: * Multi-AZ primary database (synchronous replication, automatic failover) * Daily backups in separate account/region (asynchronous replication) * PITR for recovery to transaction-precise point in time * Backup immutability via WORM or Vault Lock

RD7 – Observability Before Intervention

No production system MAY be deployed without logs, metrics and traces being available for incident investigation.

Why: Recovery without observability is flying blind. MTTR increases dramatically when diagnostic data is missing or must first be configured after the incident has started.

Implementation: * Structured JSON logs with correlated request ID * RED metrics: Rate, Errors, Duration for each service * Distributed tracing (OpenTelemetry) for inter-service dependencies * SLO dashboard available before go-live

RD8 – Chaos-Ready by Design

New services SHOULD be designed so that chaos tests can be easily conducted.

Why: Services that are difficult to test are not tested. Chaos testability is a design quality characteristic that must be considered from the beginning.

Implementation: * Chaos endpoints (feature-flag-controlled) for fault injection in staging * Explicit timeout configuration (no hard-coded values) * Health check endpoints that return real dependency status * Stop conditions documented for all automated chaos tools