Glossary: Operational Excellence

A

Alert Fatigue

A state in which on-call engineers receive so many alerts (many of them non-actionable) that they begin to ignore or snooze alerts. Leads to missed real incidents. Solution: Symptom-based alerting, regular alert audits.

Artifact Versioning

Every deployment artifact (container image, Lambda ZIP) receives an immutable version (Git SHA or semantic version number). Deployments always reference specific versions, never latest.

B

Blast Radius

The extent of the damage caused by a faulty deployment or change. Reduced by Progressive Delivery (Canary, Blue/Green): a faulty canary version affects only 5% of users, not 100%.

Blameless Culture (Blameless Postmortem)

Cultural principle: in incident reviews, the focus is not on finding culprits but on systemic causes. Psychological safety is a prerequisite. Encourages open sharing of information and prevents incidents being hidden.

Blue/Green Deployment

Deployment pattern with two identical environments (Blue = currently live, Green = new version). Traffic switch happens atomically via load balancer. Immediate rollback by switching back to Blue.

Burn Rate Alert

SRE concept: alert triggered when the error budget of an SLO is being consumed faster than permitted. Burn rate alerts distinguish between fast-burn (page immediately) and slow-burn (create ticket, no immediate page).

C

Canary Release (Canary Deployment)

Deployment pattern: new version receives gradually more traffic (5% → 25% → 100%). Metrics are compared between old and new version. Automatic rollback if the error rate of the new version exceeds the threshold.

Change Failure Rate (CFR)

DORA metric: percentage of deployments that require an incident, rollback, or hotfix. Elite teams achieve < 5%. Measures deployment quality.

CI/CD (Continuous Integration / Continuous Delivery)

Continuous Integration: Automatic building and testing on every commit. Continuous Delivery: Automatic provisioning of tested artifacts for deployment. Continuous Deployment: Fully automated deployment all the way to production without manual release.

Configuration Drift

Difference between the declared state (IaC) and the actual state of the infrastructure. Arises from manual console changes, tool errors, or configuration changes outside of IaC.

D

Deployment Frequency

DORA metric: how often does a team deploy to production? Elite: multiple times daily. High: daily to weekly. Medium: weekly to monthly. Low: monthly to every six months.

Distributed Tracing

Technology for following requests across multiple services. Each request receives a trace ID that is propagated through all involved services. Enables root cause analysis in microservices. Implementations: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry.

DORA Metrics

Four metrics from the DevOps Research and Assessment program: 1. Deployment Frequency 2. Lead Time for Changes 3. Change Failure Rate 4. Mean Time to Restore (MTTR)

F

Feature Flag (Feature Toggle)

Mechanism for controlling the visibility and activation of features at runtime, without deployment. Enables dark launch (deployed but not active), canary (% of users), A/B testing. Services: LaunchDarkly, Unleash, AWS AppConfig Feature Flags, Flagsmith.

G

GitOps

Operational paradigm: the Git repository state is the single source of truth for infrastructure and configuration. Changes through Git commits; automatic synchronization to the target environment. Tools: Flux CD, ArgoCD (for Kubernetes), Atlantis (for Terraform).

H

Health Check (Health Endpoint)

HTTP endpoint that reports the operational status of a service: Liveness: service is running (503 = restart). Readiness: service ready for traffic (503 = do not send traffic). Startup: service initialization completed.

I

IaC (Infrastructure as Code)

Practice of declaratively describing cloud infrastructure in code. Terraform, Pulumi, AWS CDK, Azure Bicep are common IaC tools. IaC is versioned, reviewed, and deployed via CI/CD.

Idempotency

Property of an operation: executing it multiple times delivers the same result as executing it once. Terraform resources are idempotent (declarative). Critical for safe automations and retry logic.

Immutable Infrastructure

Infrastructure paradigm: components are never changed in-place, but replaced. Instead of SSH+patch: build and deploy a new AMI or container image. Prevents configuration drift from partial updates.

L

Lead Time for Changes

DORA metric: time from code commit to production deployment. Measures the speed of the delivery pipeline. Elite: < 1 hour. Low: > 6 months.

Liveness Probe

Kubernetes/ECS mechanism: checks whether a container instance is still functioning. If the probe fails, the orchestrator restarts the container. Typically references /health/live.

M

Mean Time to Restore (MTTR)

DORA metric: average time from the start of an incident to the complete restoration of the service. Elite: < 1 hour. Low: > 1 week.

Metrics

Numerical time series data about the system state. Types: Counter (monotonically increasing), Gauge (current value), Histogram (distribution), Summary. RED and USE are important metrics frameworks.

O

Observability

Ability to understand the internal state of a system from its outputs (logs, metrics, traces). Three pillars: Logs (structured events), Metrics (time series), Traces (distributed request tracking).

Operational Debt

Accumulated backlog of manual processes, workarounds, missing toil reduction, and undocumented knowledge that increases operational overhead. Analogous to technical debt but in the operational context.

OpenTelemetry (OTel)

Cloud Native Computing Foundation (CNCF) standard for observability instrumentation. Vendor-agnostic. Provides SDKs for all common programming languages, auto-instrumentation for frameworks, and a collector for data routing.

P

Postmortem (Post-Incident Review)

Structured review after an incident. Blameless, fact-based, focused on systemic causes. Includes: incident timeline, root cause, contributing factors, action items. Goal: prevent recurrence.

R

RED Metrics

Metrics framework for services (by Tom Wilkie): Rate – requests per second. Errors – error rate (HTTP 5xx, exceptions). Duration – latency (p50, p95, p99).

Readiness Probe

Kubernetes/ECS mechanism: checks whether a container instance is ready to receive traffic. If the probe fails, the instance is removed from the load balancer. Typically references /health/ready.

Runbook

Operational documentation for specific scenarios: incident response, routine tasks, escalation paths. Includes: trigger condition, impact, diagnosis steps, remediation steps, escalation. Must be linked to alerts.

S

SLO (Service Level Objective)

Internal target for service quality: e.g. 99.9% availability, p99 latency < 500ms, error rate < 0.1%. Foundation for SLO-based alerting and error budget management.

Structured Logging

Logging practice: logs are output as structured objects (typically JSON) rather than free text. Enables machine processing, search, and aggregation. Required fields: timestamp, level, service, trace_id, message.

Symptom-based Alerting

Alerting philosophy: alerts are triggered when users are affected (high error rate, latency exceeded), not when internal resources are busy (CPU > 80%). Results in fewer but more actionable alerts.

T

Toil

SRE term (Google): manual, repeatable, automatable work that grows proportionally with traffic growth and creates no lasting value. Examples: manual deployments, manual scaling, password resets via ticket. Google SRE goal: < 20% of engineering time for toil.

Trace ID

Unique identifier for a single request through a distributed system. Propagated through all involved services and embedded in all log lines. Enables correlation of all logs, metrics, and traces of a request.

U

USE Metrics

Metrics framework for infrastructure resources (by Brendan Gregg): Utilization – resource utilization (% of capacity used). Saturation – how much work is waiting (queue length). Errors – error rate of the resource.