Glossary: Reliability

A

Availability

Percentage of time during which a system correctly responds to requests. Measured as (uptime / total_time) * 100. Typical targets: 99.9% (8.7h/year downtime), 99.95% (4.4h/year), 99.99% (52min/year).

Availability Zone (AZ)

Physically isolated data center within a cloud region. AZ failures only affect resources in the affected zone; other AZs remain unaffected.

B

Backup

A copy of data at a specific point in time for recovery in case of data loss. Backups are scheduled according to RPO requirements. Untested backups are not backups.

Blast Radius

Extent of damage that a single failure can cause. Reliability design aims to limit the blast radius through bulkheads, circuit breakers and AZ isolation.

Bulkhead

A resource isolation pattern: each dependency class receives its own thread and connection pools. Prevents a slow service from exhausting all resources. Named after the watertight bulkheads in ships.

C

Chaos Engineering

The discipline of experimenting on production systems through controlled fault injection to uncover systemic weaknesses. Hypothesis-driven: "If X fails, Y happens."

Circuit Breaker

A fault tolerance pattern: when the error rate exceeds a threshold, requests are immediately rejected (open state) without further stressing the failed system. After a timeout, partial traffic is permitted (half-open); on success, reset (closed).

Continuous Data Protection – CDP

A backup method that replicates data changes in real time instead of creating periodic snapshots. Enables RPO near zero.

D

Dependency

Any external service, API or system that a workload depends on for its function. Critical dependencies: failure leads to workload failure. Optional dependencies: failure leads to feature loss, not complete outage.

Disaster Recovery – DR

The totality of all plans, processes and systems for recovery after a catastrophic failure (region failure, ransomware, natural disaster).

E

Error Budget

Remaining error tolerance until an SLO violation. Calculation: (1 - SLO) * measurement_window. With a 99.9% SLO and a 30-day window: 43.2 minutes error budget per month.

Error Budget Burn Rate

The rate at which the error budget is being consumed. Burn rate 14x means: the monthly budget will be exhausted in 2 days.

F

Failover

Automatic switch to a standby resource when the primary resource fails. With RDS Multi-AZ: automatic failover to the standby instance in < 2 minutes.

Fault Injection

Controlled introduction of failures into a system for testing purposes. Tools: AWS FIS, Azure Chaos Studio, Chaos Monkey.

G

GameDay

A structured, team-wide chaos event to validate the resilience of the overall system. Typically: a half-day exercise with targeted scenarios and defined roles.

Graceful Degradation

The ability of a system to remain operational in a limited capacity when partial components fail. Opposite: fail-fast/complete outage.

H

Health Check

An endpoint or probe that checks the state of a service instance. Types: Liveness (is the process alive?), Readiness (is the service accepting traffic?), Startup (has the service started?).

I

Idempotency

The property of an operation that produces the same result when executed multiple times. Essential for safe retry logic.

L

Liveness Probe

A Kubernetes probe that checks whether a container is still alive. On failure: the container is restarted. Detects deadlocks and infinite loops.

M

Maturity Model

A framework for assessing the current state of a discipline on a defined scale. WAF++ Reliability: 5 stages (Chaotic → Self-Healing).

Mean Time Between Failures – MTBF

The average time between two consecutive failures. The higher, the more reliable. Relevant for hardware and long-lived systems.

Mean Time to Recovery – MTTR

The average time from the occurrence of a failure to full recovery. Includes detection time (MTTD) + diagnostic time + remediation time.

Mean Time to Detect – MTTD

The average time from failure occurrence to detection by monitoring or users. Good monitoring reduces MTTD to < 5 minutes.

Mean Time to Failure – MTTF

The expected operating time until the first failure. Used for non-repairable systems (hardware components).

P

Point-in-Time Recovery – PITR

The ability to restore a database to any point in the past. Requires continuous transaction log archiving.

R

Readiness Probe

A Kubernetes probe that checks whether a container is ready to accept traffic. On failure: the pod is removed from the service endpoint, but not restarted. Prevents premature traffic routing during startup.

Recovery Point Objective – RPO

The maximum acceptable data loss in a failure scenario, measured in time. RPO = 1h: up to 1 hour of data loss is acceptable. Determines backup frequency.

Recovery Time Objective – RTO

The maximum acceptable time for full recovery after a failure. RTO = 30min: the service must be restored within 30 minutes.

Reliability Debt

Known weaknesses or deferred reliability improvements that increase the risk of failures. Analogous to technical debt; tracked in the WAF-REL-100 register.

Retry with Exponential Backoff

A retry strategy where the wait time between attempts doubles exponentially. Prevents retry storms. With jitter: random variation prevents synchronized retries from many clients.

Runbook

A documented step-by-step guide for diagnosing and resolving a specific incident or alert type. Must be directly linked from alert notifications.

S

Service Level Agreement – SLA

A contractual agreement on the availability and quality of a service. SLAs reference SLOs and define consequences for non-fulfillment.

Service Level Indicator – SLI

A concrete metric that measures an aspect of service quality. Examples: availability (%), latency (p99 ms), error rate (%), throughput (req/s).

Service Level Objective – SLO

An internal target for an SLI. Example: "HTTP availability >= 99.9% over 30 days." SLOs drive error budgets and reliability decisions.

Single Point of Failure – SPOF

A component whose failure renders the entire system or an SLO-relevant part unavailable. SPOFs are architectural debt and are tracked in WAF-REL-100.