Principles of Operational Excellence

The seven principles define the philosophical foundation of the Operational Excellence pillar. They are not technical requirements (that is what controls are for), but guiding principles for decisions at the team and architecture level.

OP1 – Automate Everything Repeatable

Tagline: If you do something manually twice, that is once too many.

Explanation

Every task that is performed repeatedly is a candidate for automation. Manual, repeatable tasks are by definition toil – they are error-prone, time-consuming, and do not scale with organizational growth.

Automation creates freedom: when routine tasks are handled by systems, engineers can work on problems that require human judgment.

Implications

Deployments are automated or do not exist (for production)
Infrastructure provisioning is automated via IaC
Incident notifications are automated via Alertmanager/PagerDuty
Certificate renewal, AMI patching, credential rotation are automated

Related Controls

OP2 – Infrastructure as Code First

Tagline: If it is not in Git, it does not exist.

Explanation

All infrastructure must be reproducible from code. Not "we mostly have IaC" – but "all infrastructure is IaC and manual changes are forbidden".

IaC First means: before the first resource is created, the Terraform code exists. It also means: if an emergency change is made manually, it is transferred into IaC within 24 hours.

Implications

Remote state with locking is the only accepted state strategy
Console access for infrastructure changes is restricted (SCP/IAM)
Drift detection is automated and alerts
Disaster Recovery is testable from IaC

Related Controls

OP3 – Observability Before Features

Tagline: Before a feature goes to production, it must be observable.

Explanation

A system that is not observable cannot be operated reliably. Observability is not an afterthought – it is a prerequisite for production operations.

"Observability before features" means: no new feature is deployed that does not emit structured logs, expose metrics, and integrate with the tracing infrastructure.

Implications

Observability requirements are part of the Definition of Done for every feature
Log format and trace ID propagation are defined in the service template
RED metrics (Rate, Errors, Duration) are configured for every service
Dashboards exist for every service before it receives critical traffic

Related Controls

OP4 – Fail Fast, Learn Faster

Tagline: Failures are not catastrophes – missing learning processes are.

Explanation

In complex distributed systems, failures are inevitable. The question is not whether a system will fail, but how quickly it is detected, how quickly it is restored, and what the organization learns from it.

Blameless Culture is the cultural foundation: if failures lead to punishment, they will be hidden. If failures lead to learning processes, they are openly communicated.

Implications

Every incident with user impact receives a postmortem
Postmortems are blameless – focus on systems, not people
Action items from postmortems are tracked in JIRA/GitHub
Repeat incidents of the same class are a leadership failure, not an engineering failure

Related Controls

WAF-OPS-070 – Post-Incident Review

OP5 – Minimize Toil

Tagline: Toil is the invisible enemy of operational excellence.

Explanation

Toil is manual, repeatable, automatable work with no lasting value. Toil is not overhead or unwanted work – it is specifically: repetitive, manual, without lasting effect, proportional to traffic growth.

Google SRE defines the goal: less than 20% of engineering time should be spent on toil. When a team produces more toil than features, it is in the debt spiral.

Implications

Toil is measured: hours per week per engineer for manual tasks
Operational Debt Register catalogs all known toil sources
Automation of toil is an explicit sprint goal
On-call burden from toil is a team health indicator

Related Controls

OP6 – Safe by Default

Tagline: Deployment safety is not opt-in – it is the standard.

Explanation

Safe deployments are the default, not the exception. Progressive Delivery (Canary, Blue/Green, Feature Flags) is not optional for teams that deploy frequently.

"Safe by Default" means: the team must actively decide to choose an unsafe deployment path – not actively opt into a safe one.

Implications

Deployment configuration uses Canary or Blue/Green as default
Rollback is possible in < 5 minutes without a new deployment
Feature Flags enable rollback without a deployment cycle
Deployment freeze is automatically configured for critical business periods

Related Controls

OP7 – Operational Debt Is Technical Risk

Tagline: Untracked Operational Debt is a ticking time bomb.

Explanation

Operational Debt is like technical debt – it accumulates interest in the form of increased incident rates, extended MTTR, on-call burnout, and dependency on specific individuals.

The difference: technical debt is visible in code. Operational Debt is often invisible – it lives in engineers' minds, in informal processes, in "that’s how we always do it" cultures.

Making it visible, quantifying it, prioritizing it – that is the first step.

Implications

All known toil sources and workarounds are documented in the register
Quarterly reviews ensure debt does not accumulate
Sprint capacity for debt reduction is explicitly allocated
New Operational Debt is consciously accepted – not blindly accumulated