Design Principles: Operational Excellence
The design principles are technical requirements for the architecture of systems that are to be operated excellently. They complement the principles (OP1–OP7) with concrete design decisions.
OD1 – Structured Logging with Consistent Schema
All services MUST output structured logs in JSON. The log schema MUST contain at minimum the following fields:
| Field | Type | Description |
|---|---|---|
|
ISO 8601 |
Timestamp of the log line in UTC |
|
String |
|
|
String |
Service name (from environment variable or deployment configuration) |
|
String |
OpenTelemetry trace ID or AWS X-Ray trace ID |
|
String |
OpenTelemetry span ID |
|
String |
Request-specific ID for log correlation |
|
String |
Human-readable description of the event |
|
Object (optional) |
Error details with |
OD2 – Trace ID Propagation Across Service Boundaries
All services MUST extract trace IDs and span IDs from incoming requests and propagate them in outgoing requests.
-
W3C TraceContext headers (
traceparent,tracestate) are the standard -
If a trace context is absent, a new trace MUST be created
-
Async communication (SQS, Kafka) MUST transport trace context in message headers
OD3 – Health Endpoints as Operational Interface
Every service MUST provide the following health endpoints:
| Endpoint | HTTP Method | Description |
|---|---|---|
|
GET |
Liveness probe: service is running (200 = alive, 503 = restart required) |
|
GET |
Readiness probe: service is ready for traffic (checks DB connection, dependencies) |
|
GET |
Startup probe: service initialization completed |
|
GET |
Prometheus metrics (Prometheus format or OpenMetrics) |
OD4 – Immutable Artifacts & Versioned Deployments
Every deployment artifact (container image, Lambda ZIP, AMI) MUST:
-
Be immutable – no in-place patching of deployed artifacts
-
Be versioned – Git SHA or semantic version as tag
-
Be signed (maturity level 4+) – container image signing via Cosign or Notation
OD5 – Blue/Green or Canary as Default Deployment Pattern
Production deployments MUST have a mechanism for gradual traffic shifting.
| Pattern | Description | Recommended Use |
|---|---|---|
Blue/Green |
Two complete environments; traffic switch via load balancer |
Stateful services, database migration deployments |
Canary |
Gradual traffic increase of 5% → 25% → 100% |
Stateless services, frequent deployments |
Feature Flags |
Code deployed dark; feature activated via flag |
New features, A/B tests, low-risk rollouts |
OD6 – Idempotent Operations & Retry Safety
All operational operations (deployments, remediation scripts, runbook automations) MUST be idempotent: executing them multiple times must not produce a different result than executing them once.
OD7 – Configuration External and Auditable
All configuration MUST be separated from code (12-Factor App, Factor III).
-
Environment-specific configuration in environment variables or Secrets Manager
-
Configuration defined in IaC (Terraform variable, Parameter Store, App Config)
-
Configuration is versioned – changes are traceable
-
Sensitive configuration (credentials, tokens) NEVER in code or IaC variables
OD8 – Operational Readiness as Deployment Gate
Before a service goes to production, it MUST demonstrate "Operational Readiness":
| Criterion | Evidence |
|---|---|
Observability |
Structured logs, metrics, tracing configured and verified |
Health Endpoints |
|
Alerting |
At least one symptom-based alert configured and linked to a runbook |
Runbook |
Runbook for deployment and known failure scenarios present |
Rollback |
Rollback procedure documented and tested |
Dependency Inventory |
All dependencies (DBs, external services, queues) documented |