WAF++ WAF++
Back to WAF++ Homepage

Principles of Operational Excellence

The seven principles define the philosophical foundation of the Operational Excellence pillar. They are not technical requirements (that is what controls are for), but guiding principles for decisions at the team and architecture level.

OP1 – Automate Everything Repeatable

Tagline: If you do something manually twice, that is once too many.

Explanation

Every task that is performed repeatedly is a candidate for automation. Manual, repeatable tasks are by definition toil – they are error-prone, time-consuming, and do not scale with organizational growth.

Automation creates freedom: when routine tasks are handled by systems, engineers can work on problems that require human judgment.

Implications

  • Deployments are automated or do not exist (for production)

  • Infrastructure provisioning is automated via IaC

  • Incident notifications are automated via Alertmanager/PagerDuty

  • Certificate renewal, AMI patching, credential rotation are automated

OP2 – Infrastructure as Code First

Tagline: If it is not in Git, it does not exist.

Explanation

All infrastructure must be reproducible from code. Not "we mostly have IaC" – but "all infrastructure is IaC and manual changes are forbidden".

IaC First means: before the first resource is created, the Terraform code exists. It also means: if an emergency change is made manually, it is transferred into IaC within 24 hours.

Implications

  • Remote state with locking is the only accepted state strategy

  • Console access for infrastructure changes is restricted (SCP/IAM)

  • Drift detection is automated and alerts

  • Disaster Recovery is testable from IaC

OP3 – Observability Before Features

Tagline: Before a feature goes to production, it must be observable.

Explanation

A system that is not observable cannot be operated reliably. Observability is not an afterthought – it is a prerequisite for production operations.

"Observability before features" means: no new feature is deployed that does not emit structured logs, expose metrics, and integrate with the tracing infrastructure.

Implications

  • Observability requirements are part of the Definition of Done for every feature

  • Log format and trace ID propagation are defined in the service template

  • RED metrics (Rate, Errors, Duration) are configured for every service

  • Dashboards exist for every service before it receives critical traffic

OP4 – Fail Fast, Learn Faster

Tagline: Failures are not catastrophes – missing learning processes are.

Explanation

In complex distributed systems, failures are inevitable. The question is not whether a system will fail, but how quickly it is detected, how quickly it is restored, and what the organization learns from it.

Blameless Culture is the cultural foundation: if failures lead to punishment, they will be hidden. If failures lead to learning processes, they are openly communicated.

Implications

  • Every incident with user impact receives a postmortem

  • Postmortems are blameless – focus on systems, not people

  • Action items from postmortems are tracked in JIRA/GitHub

  • Repeat incidents of the same class are a leadership failure, not an engineering failure

OP5 – Minimize Toil

Tagline: Toil is the invisible enemy of operational excellence.

Explanation

Toil is manual, repeatable, automatable work with no lasting value. Toil is not overhead or unwanted work – it is specifically: repetitive, manual, without lasting effect, proportional to traffic growth.

Google SRE defines the goal: less than 20% of engineering time should be spent on toil. When a team produces more toil than features, it is in the debt spiral.

Implications

  • Toil is measured: hours per week per engineer for manual tasks

  • Operational Debt Register catalogs all known toil sources

  • Automation of toil is an explicit sprint goal

  • On-call burden from toil is a team health indicator

OP6 – Safe by Default

Tagline: Deployment safety is not opt-in – it is the standard.

Explanation

Safe deployments are the default, not the exception. Progressive Delivery (Canary, Blue/Green, Feature Flags) is not optional for teams that deploy frequently.

"Safe by Default" means: the team must actively decide to choose an unsafe deployment path – not actively opt into a safe one.

Implications

  • Deployment configuration uses Canary or Blue/Green as default

  • Rollback is possible in < 5 minutes without a new deployment

  • Feature Flags enable rollback without a deployment cycle

  • Deployment freeze is automatically configured for critical business periods

OP7 – Operational Debt Is Technical Risk

Tagline: Untracked Operational Debt is a ticking time bomb.

Explanation

Operational Debt is like technical debt – it accumulates interest in the form of increased incident rates, extended MTTR, on-call burnout, and dependency on specific individuals.

The difference: technical debt is visible in code. Operational Debt is often invisible – it lives in engineers' minds, in informal processes, in "that’s how we always do it" cultures.

Making it visible, quantifying it, prioritizing it – that is the first step.

Implications

  • All known toil sources and workarounds are documented in the register

  • Quarterly reviews ensure debt does not accumulate

  • Sprint capacity for debt reduction is explicitly allocated

  • New Operational Debt is consciously accepted – not blindly accumulated