Design Principles: Operational Excellence

The design principles are technical requirements for the architecture of systems that are to be operated excellently. They complement the principles (OP1–OP7) with concrete design decisions.

OD1 – Structured Logging with Consistent Schema

All services MUST output structured logs in JSON. The log schema MUST contain at minimum the following fields:

Field Type Description

Field	Type	Description
`timestamp`	ISO 8601	Timestamp of the log line in UTC
`level`	String	`DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL`
`service`	String	Service name (from environment variable or deployment configuration)
`trace_id`	String	OpenTelemetry trace ID or AWS X-Ray trace ID
`span_id`	String	OpenTelemetry span ID
`request_id`	String	Request-specific ID for log correlation
`message`	String	Human-readable description of the event
`error`	Object (optional)	Error details with `type`, `message`, `stack`

timestamp

ISO 8601

Timestamp of the log line in UTC

level

String

DEBUG, INFO, WARN, ERROR, FATAL

service

String

Service name (from environment variable or deployment configuration)

trace_id

String

OpenTelemetry trace ID or AWS X-Ray trace ID

span_id

String

OpenTelemetry span ID

request_id

String

Request-specific ID for log correlation

message

String

Human-readable description of the event

error

Object (optional)

Error details with type, message, stack

Implication

Logging framework configuration is part of the service template
Log level is configurable via environment variable (no rebuild for level changes)
Sensitive data (passwords, PII) MUST NEVER appear in logs – scrubbing in the logger

OD2 – Trace ID Propagation Across Service Boundaries

All services MUST extract trace IDs and span IDs from incoming requests and propagate them in outgoing requests.

W3C TraceContext headers (traceparent, tracestate) are the standard
If a trace context is absent, a new trace MUST be created
Async communication (SQS, Kafka) MUST transport trace context in message headers

Implication

OpenTelemetry SDK is configured in the service template
HTTP clients and servers are automatically instrumented (auto-instrumentation)
Async messaging integration is explicitly instrumented

OD3 – Health Endpoints as Operational Interface

Every service MUST provide the following health endpoints:

Endpoint HTTP Method Description

Endpoint	HTTP Method	Description
`/health/live`	GET	Liveness probe: service is running (200 = alive, 503 = restart required)
`/health/ready`	GET	Readiness probe: service is ready for traffic (checks DB connection, dependencies)
`/health/startup`	GET	Startup probe: service initialization completed
`/metrics`	GET	Prometheus metrics (Prometheus format or OpenMetrics)

/health/live

GET

Liveness probe: service is running (200 = alive, 503 = restart required)

/health/ready

GET

Readiness probe: service is ready for traffic (checks DB connection, dependencies)

/health/startup

GET

Startup probe: service initialization completed

/metrics

GET

Prometheus metrics (Prometheus format or OpenMetrics)

Implication

Kubernetes/ECS liveness and readiness probes reference these endpoints
/health/ready may only return 200 when all dependencies are reachable
/metrics is not publicly accessible (internal service mesh or VPC-only)

OD4 – Immutable Artifacts & Versioned Deployments

Every deployment artifact (container image, Lambda ZIP, AMI) MUST:

Be immutable – no in-place patching of deployed artifacts
Be versioned – Git SHA or semantic version as tag
Be signed (maturity level 4+) – container image signing via Cosign or Notation

Implication

latest tags in production deployments are forbidden
Container images are not modified after deployment
Rollback = deployment of the previous version (not patching the running one)

OD5 – Blue/Green or Canary as Default Deployment Pattern

Production deployments MUST have a mechanism for gradual traffic shifting.

Pattern	Description	Recommended Use
Blue/Green	Two complete environments; traffic switch via load balancer	Stateful services, database migration deployments
Canary	Gradual traffic increase of 5% → 25% → 100%	Stateless services, frequent deployments
Feature Flags	Code deployed dark; feature activated via flag	New features, A/B tests, low-risk rollouts

Pattern

Description

Recommended Use

Blue/Green

Two complete environments; traffic switch via load balancer

Stateful services, database migration deployments

Canary

Gradual traffic increase of 5% → 25% → 100%

Stateless services, frequent deployments

Feature Flags

Code deployed dark; feature activated via flag

New features, A/B tests, low-risk rollouts

Implication

Deployment configuration defines the traffic split mechanism
Health checks automatically determine whether to promote or roll back
All deployments are rollback-able within 5 minutes

OD6 – Idempotent Operations & Retry Safety

All operational operations (deployments, remediation scripts, runbook automations) MUST be idempotent: executing them multiple times must not produce a different result than executing them once.

Implication

Terraform is idempotent by definition (declarative)
Runbook automations check pre-state before action
API calls in automations use upsert semantics, not create
Retry logic in deployments and automations is configured

OD7 – Configuration External and Auditable

All configuration MUST be separated from code (12-Factor App, Factor III).

Environment-specific configuration in environment variables or Secrets Manager
Configuration defined in IaC (Terraform variable, Parameter Store, App Config)
Configuration is versioned – changes are traceable
Sensitive configuration (credentials, tokens) NEVER in code or IaC variables

Implication

AWS Parameter Store, AWS Secrets Manager, Azure Key Vault, GCP Secret Manager integrated
Configuration-as-Code approach: changes to configuration go through pull request
Configuration changes are visible in the audit trail

OD8 – Operational Readiness as Deployment Gate

Before a service goes to production, it MUST demonstrate "Operational Readiness":

Criterion Evidence

Criterion	Evidence
Observability	Structured logs, metrics, tracing configured and verified
Health Endpoints	`/health/live` and `/health/ready` present and tested
Alerting	At least one symptom-based alert configured and linked to a runbook
Runbook	Runbook for deployment and known failure scenarios present
Rollback	Rollback procedure documented and tested
Dependency Inventory	All dependencies (DBs, external services, queues) documented

Observability

Structured logs, metrics, tracing configured and verified

Health Endpoints

/health/live and /health/ready present and tested

Alerting

At least one symptom-based alert configured and linked to a runbook

Runbook

Runbook for deployment and known failure scenarios present

Rollback

Rollback procedure documented and tested

Dependency Inventory

All dependencies (DBs, external services, queues) documented

Implication

Operational Readiness Checklist is part of the deployment pull request
Peer review includes verification of readiness criteria
Services without Operational Readiness evidence may not go to production