WAF++ WAF++
Back to WAF++ Homepage

Design Principles: Operational Excellence

The design principles are technical requirements for the architecture of systems that are to be operated excellently. They complement the principles (OP1–OP7) with concrete design decisions.

OD1 – Structured Logging with Consistent Schema

All services MUST output structured logs in JSON. The log schema MUST contain at minimum the following fields:

Field Type Description

timestamp

ISO 8601

Timestamp of the log line in UTC

level

String

DEBUG, INFO, WARN, ERROR, FATAL

service

String

Service name (from environment variable or deployment configuration)

trace_id

String

OpenTelemetry trace ID or AWS X-Ray trace ID

span_id

String

OpenTelemetry span ID

request_id

String

Request-specific ID for log correlation

message

String

Human-readable description of the event

error

Object (optional)

Error details with type, message, stack

Implication

  • Logging framework configuration is part of the service template

  • Log level is configurable via environment variable (no rebuild for level changes)

  • Sensitive data (passwords, PII) MUST NEVER appear in logs – scrubbing in the logger


OD2 – Trace ID Propagation Across Service Boundaries

All services MUST extract trace IDs and span IDs from incoming requests and propagate them in outgoing requests.

  • W3C TraceContext headers (traceparent, tracestate) are the standard

  • If a trace context is absent, a new trace MUST be created

  • Async communication (SQS, Kafka) MUST transport trace context in message headers

Implication

  • OpenTelemetry SDK is configured in the service template

  • HTTP clients and servers are automatically instrumented (auto-instrumentation)

  • Async messaging integration is explicitly instrumented


OD3 – Health Endpoints as Operational Interface

Every service MUST provide the following health endpoints:

Endpoint HTTP Method Description

/health/live

GET

Liveness probe: service is running (200 = alive, 503 = restart required)

/health/ready

GET

Readiness probe: service is ready for traffic (checks DB connection, dependencies)

/health/startup

GET

Startup probe: service initialization completed

/metrics

GET

Prometheus metrics (Prometheus format or OpenMetrics)

Implication

  • Kubernetes/ECS liveness and readiness probes reference these endpoints

  • /health/ready may only return 200 when all dependencies are reachable

  • /metrics is not publicly accessible (internal service mesh or VPC-only)


OD4 – Immutable Artifacts & Versioned Deployments

Every deployment artifact (container image, Lambda ZIP, AMI) MUST:

  • Be immutable – no in-place patching of deployed artifacts

  • Be versioned – Git SHA or semantic version as tag

  • Be signed (maturity level 4+) – container image signing via Cosign or Notation

Implication

  • latest tags in production deployments are forbidden

  • Container images are not modified after deployment

  • Rollback = deployment of the previous version (not patching the running one)


OD5 – Blue/Green or Canary as Default Deployment Pattern

Production deployments MUST have a mechanism for gradual traffic shifting.

Pattern Description Recommended Use

Blue/Green

Two complete environments; traffic switch via load balancer

Stateful services, database migration deployments

Canary

Gradual traffic increase of 5% → 25% → 100%

Stateless services, frequent deployments

Feature Flags

Code deployed dark; feature activated via flag

New features, A/B tests, low-risk rollouts

Implication

  • Deployment configuration defines the traffic split mechanism

  • Health checks automatically determine whether to promote or roll back

  • All deployments are rollback-able within 5 minutes


OD6 – Idempotent Operations & Retry Safety

All operational operations (deployments, remediation scripts, runbook automations) MUST be idempotent: executing them multiple times must not produce a different result than executing them once.

Implication

  • Terraform is idempotent by definition (declarative)

  • Runbook automations check pre-state before action

  • API calls in automations use upsert semantics, not create

  • Retry logic in deployments and automations is configured


OD7 – Configuration External and Auditable

All configuration MUST be separated from code (12-Factor App, Factor III).

  • Environment-specific configuration in environment variables or Secrets Manager

  • Configuration defined in IaC (Terraform variable, Parameter Store, App Config)

  • Configuration is versioned – changes are traceable

  • Sensitive configuration (credentials, tokens) NEVER in code or IaC variables

Implication

  • AWS Parameter Store, AWS Secrets Manager, Azure Key Vault, GCP Secret Manager integrated

  • Configuration-as-Code approach: changes to configuration go through pull request

  • Configuration changes are visible in the audit trail


OD8 – Operational Readiness as Deployment Gate

Before a service goes to production, it MUST demonstrate "Operational Readiness":

Criterion Evidence

Observability

Structured logs, metrics, tracing configured and verified

Health Endpoints

/health/live and /health/ready present and tested

Alerting

At least one symptom-based alert configured and linked to a runbook

Runbook

Runbook for deployment and known failure scenarios present

Rollback

Rollback procedure documented and tested

Dependency Inventory

All dependencies (DBs, external services, queues) documented

Implication

  • Operational Readiness Checklist is part of the deployment pull request

  • Peer review includes verification of readiness criteria

  • Services without Operational Readiness evidence may not go to production