Glossary: Reliability
A
B
Backup
A copy of data at a specific point in time for recovery in case of data loss. Backups are scheduled according to RPO requirements. Untested backups are not backups.
C
Chaos Engineering
The discipline of experimenting on production systems through controlled fault injection to uncover systemic weaknesses. Hypothesis-driven: "If X fails, Y happens."
D
E
G
M
Maturity Model
A framework for assessing the current state of a discipline on a defined scale. WAF++ Reliability: 5 stages (Chaotic → Self-Healing).
Mean Time Between Failures – MTBF
The average time between two consecutive failures. The higher, the more reliable. Relevant for hardware and long-lived systems.
Mean Time to Recovery – MTTR
The average time from the occurrence of a failure to full recovery. Includes detection time (MTTD) + diagnostic time + remediation time.
R
Readiness Probe
A Kubernetes probe that checks whether a container is ready to accept traffic. On failure: the pod is removed from the service endpoint, but not restarted. Prevents premature traffic routing during startup.
Recovery Point Objective – RPO
The maximum acceptable data loss in a failure scenario, measured in time. RPO = 1h: up to 1 hour of data loss is acceptable. Determines backup frequency.
Recovery Time Objective – RTO
The maximum acceptable time for full recovery after a failure. RTO = 30min: the service must be restored within 30 minutes.
Reliability Debt
Known weaknesses or deferred reliability improvements that increase the risk of failures. Analogous to technical debt; tracked in the WAF-REL-100 register.
S
Service Level Agreement – SLA
A contractual agreement on the availability and quality of a service. SLAs reference SLOs and define consequences for non-fulfillment.
Service Level Indicator – SLI
A concrete metric that measures an aspect of service quality. Examples: availability (%), latency (p99 ms), error rate (%), throughput (req/s).