Principles of Operational Excellence
The seven principles define the philosophical foundation of the Operational Excellence pillar. They are not technical requirements (that is what controls are for), but guiding principles for decisions at the team and architecture level.
OP1 – Automate Everything Repeatable
Tagline: If you do something manually twice, that is once too many.
Explanation
Every task that is performed repeatedly is a candidate for automation. Manual, repeatable tasks are by definition toil – they are error-prone, time-consuming, and do not scale with organizational growth.
Automation creates freedom: when routine tasks are handled by systems, engineers can work on problems that require human judgment.
OP2 – Infrastructure as Code First
Tagline: If it is not in Git, it does not exist.
Explanation
All infrastructure must be reproducible from code. Not "we mostly have IaC" – but "all infrastructure is IaC and manual changes are forbidden".
IaC First means: before the first resource is created, the Terraform code exists. It also means: if an emergency change is made manually, it is transferred into IaC within 24 hours.
OP3 – Observability Before Features
Tagline: Before a feature goes to production, it must be observable.
Explanation
A system that is not observable cannot be operated reliably. Observability is not an afterthought – it is a prerequisite for production operations.
"Observability before features" means: no new feature is deployed that does not emit structured logs, expose metrics, and integrate with the tracing infrastructure.
Implications
-
Observability requirements are part of the Definition of Done for every feature
-
Log format and trace ID propagation are defined in the service template
-
RED metrics (Rate, Errors, Duration) are configured for every service
-
Dashboards exist for every service before it receives critical traffic
OP4 – Fail Fast, Learn Faster
Tagline: Failures are not catastrophes – missing learning processes are.
Explanation
In complex distributed systems, failures are inevitable. The question is not whether a system will fail, but how quickly it is detected, how quickly it is restored, and what the organization learns from it.
Blameless Culture is the cultural foundation: if failures lead to punishment, they will be hidden. If failures lead to learning processes, they are openly communicated.
OP5 – Minimize Toil
Tagline: Toil is the invisible enemy of operational excellence.
Explanation
Toil is manual, repeatable, automatable work with no lasting value. Toil is not overhead or unwanted work – it is specifically: repetitive, manual, without lasting effect, proportional to traffic growth.
Google SRE defines the goal: less than 20% of engineering time should be spent on toil. When a team produces more toil than features, it is in the debt spiral.
OP6 – Safe by Default
Tagline: Deployment safety is not opt-in – it is the standard.
Explanation
Safe deployments are the default, not the exception. Progressive Delivery (Canary, Blue/Green, Feature Flags) is not optional for teams that deploy frequently.
"Safe by Default" means: the team must actively decide to choose an unsafe deployment path – not actively opt into a safe one.
OP7 – Operational Debt Is Technical Risk
Tagline: Untracked Operational Debt is a ticking time bomb.
Explanation
Operational Debt is like technical debt – it accumulates interest in the form of increased incident rates, extended MTTR, on-call burnout, and dependency on specific individuals.
The difference: technical debt is visible in code. Operational Debt is often invisible – it lives in engineers' minds, in informal processes, in "that’s how we always do it" cultures.
Making it visible, quantifying it, prioritizing it – that is the first step.