WAF++ WAF++
Back to WAF++ Homepage

Definition: Operational Excellence

What Is Operational Excellence?

Operational Excellence is the ability of an organization to operate its software systems in production reproducibly, automatically, observably, and with continuous improvement.

It is not about solving problems quickly – it is about building an organization that systematically prevents problems and generates organizational knowledge from every failure.

Core Definition: Operational Excellence is the discipline that ensures an organization knows what is happening in its systems, why changes are being made, how failures should be handled – and that this knowledge is anchored in code, processes, and culture, not just in individual minds.

The OpsEx Spectrum

Organizations exist on a continuum of operational maturity:

Level Name Characteristics

1

Manual & Heroic

Deployments via SSH and console. "Florian knows how to restart that." On-call culture defined by heroes who never sleep. High turnover.

2

Documented

Runbooks exist but become outdated quickly. Deployments follow scripts but not pipelines. Incidents are discussed but not systematically reviewed.

3

Automated

CI/CD for all deployments. IaC for all infrastructure. Structured logging and alerting. Runbooks linked to alerts. Post-incident reviews taking place.

4

Measured

DORA metrics are measured and improved. SLO-based alerting. Toil is quantified. Operational Debt is visible and systematically reduced.

5

Continuously Improved

Deployments multiple times daily. Change Failure Rate < 5%. MTTR < 1 hour. Systems learn from themselves. Toil < 20% of engineering time.

What Operational Excellence Is NOT

Common misconceptions:

Not just monitoring: Monitoring is a component of Observability – but Observability is only one of seven OpsEx dimensions. An organization can have excellent monitoring and still have poor change processes, missing runbooks, and accumulating Operational Debt.

Not just DevOps tooling: Tools (Jenkins, Terraform, Datadog) are enablers, not solutions. Tooling without process and culture creates complex toolchains without operational maturity.

Not just ITIL compliance: ITIL describes processes, not outcomes. An organization can have complete CAB processes and still have a high Change Failure Rate. Operational Excellence measures outcomes (DORA metrics), not processes.

Not just for DevOps teams: Operational Excellence applies to everyone who operates software in production – whether Platform Team, Application Team, or Operations Team.

Target State: Mature Operations Organization

An organization with excellent operations has the following characteristics:

Technical

  • Every production workload is fully described by IaC and deployed via CI/CD

  • All services emit structured logs, metrics, and traces – fully via OpenTelemetry

  • All alerts are symptom-based with linked runbooks

  • No engineer needs to manually SSH into production servers – changes go through pipeline or runbook automation

  • Configuration drift is detected within hours and remediated

Process

  • Every incident with user impact has a postmortem within 5 business days

  • Postmortems are blameless and produce trackable action items

  • Quarterly review of the Operational Debt Register with sprint capacity allocation

  • Runbook Coverage > 90% for critical services

Cultural

  • Blameless Culture: failures are learning opportunities, not punishment occasions

  • Toil reduction is an explicit team goal (OKR/KPI)

  • On-call is fairly distributed, rotating, and not associated with burnout

  • Engineers spend < 20% of their time on toil (Google SRE goal)

Operational Excellence in the WAF++ Context

Operational Excellence interacts with all other WAF++ pillars:

Pillar Interaction with Operational Excellence

Reliability

SLOs (Reliability) are monitored through Observability and Alerting (OpsEx). Incident Response (Reliability) is improved through Runbooks and Postmortems (OpsEx).

Security

Security Controls (Security) are enforced through IaC (OpsEx). Security incidents are detected through Observability (OpsEx) and lessons are learned through the Postmortem Process (OpsEx).

Architecture

Architecture decisions (Architecture) influence Observability complexity (OpsEx). OpsEx insights (drift, toil, incidents) should inform Architecture Reviews.

Governance

Change Management (OpsEx) is the foundation for Governance compliance. Operational Debt Register (OpsEx) informs Governance prioritization.

Cost

Observability costs are governed by OpsEx controls (log retention, trace sampling). Operational Debt creates hidden costs through manual effort.