Glossary: Operational Excellence
A
B
Blast Radius
The extent of the damage caused by a faulty deployment or change. Reduced by Progressive Delivery (Canary, Blue/Green): a faulty canary version affects only 5% of users, not 100%.
Blameless Culture (Blameless Postmortem)
Cultural principle: in incident reviews, the focus is not on finding culprits but on systemic causes. Psychological safety is a prerequisite. Encourages open sharing of information and prevents incidents being hidden.
C
Canary Release (Canary Deployment)
Deployment pattern: new version receives gradually more traffic (5% → 25% → 100%). Metrics are compared between old and new version. Automatic rollback if the error rate of the new version exceeds the threshold.
Change Failure Rate (CFR)
DORA metric: percentage of deployments that require an incident, rollback, or hotfix. Elite teams achieve < 5%. Measures deployment quality.
CI/CD (Continuous Integration / Continuous Delivery)
Continuous Integration: Automatic building and testing on every commit. Continuous Delivery: Automatic provisioning of tested artifacts for deployment. Continuous Deployment: Fully automated deployment all the way to production without manual release.
D
Deployment Frequency
DORA metric: how often does a team deploy to production? Elite: multiple times daily. High: daily to weekly. Medium: weekly to monthly. Low: monthly to every six months.
I
IaC (Infrastructure as Code)
Practice of declaratively describing cloud infrastructure in code. Terraform, Pulumi, AWS CDK, Azure Bicep are common IaC tools. IaC is versioned, reviewed, and deployed via CI/CD.
L
M
O
Observability
Ability to understand the internal state of a system from its outputs (logs, metrics, traces). Three pillars: Logs (structured events), Metrics (time series), Traces (distributed request tracking).
R
RED Metrics
Metrics framework for services (by Tom Wilkie): Rate – requests per second. Errors – error rate (HTTP 5xx, exceptions). Duration – latency (p50, p95, p99).
S
SLO (Service Level Objective)
Internal target for service quality: e.g. 99.9% availability, p99 latency < 500ms, error rate < 0.1%. Foundation for SLO-based alerting and error budget management.