WAF++ WAF++
Back to WAF++ Homepage

Architectural Cost Debt

Architectural Cost Debt is the central new concept of the Cost Optimization pillar. It refers to economic burdens arising from past architecture decisions that now generate monthly costs – often without the connection to the original decision still being visible.

Definition

Architectural Cost Debt is the accumulation of long-term economic impacts of architectural decisions that, at the time of the decision, were not fully assessed or were consciously accepted.

Architectural Cost Debt arises from:
  ├── Decisions without cost impact assessment
  ├── HA/multi-region without SLO basis ("let's make it safe")
  ├── Managed services with high lock-in without exit plan
  ├── Infinite retention as default ("we keep everything")
  ├── Observability over-engineering (DEBUG in production, no sampling)
  └── Missing termination of expired usage (forgotten reservations, snapshots, buckets)

The analogy to technical debt is deliberate:

Dimension Technical Debt Architectural Cost Debt

Origin

Poor code, missing tests, outdated dependencies

Missing TCO assessment, lock-in without alternative plan, HA over-engineering

Visibility

Often visible in code review, Sonar metrics, incident patterns

Often invisible: buried in cloud bills, not attributed to any decision

Interest

Slowed development, growing bug rate, harder onboarding

Monthly additional costs, rising fixed costs without business growth

Paydown

Refactoring, upgrades, increasing test coverage

Architecture revisions, exiting lock-in services, retention cleanup

Governance

Tech debt register, sprint allocation for debt reduction

Cost debt register, quarterly cost debt review with architecture board

Typical Anti-Patterns (The 5 most common)

AP1 – HA without SLO basis

Pattern: Multi-AZ or multi-region deployment for services whose SLO requires 99.5% or less – which would be achievable with a single AZ.

Cost impact: 2–3x infrastructure costs for redundancy that covers no business requirement.

Detection signal: Service SLO is below what the provider guarantees for a single AZ. Or: SLO was never documented, but HA architecture was built anyway.

Paydown: SLO review → decision: raise SLO (and keep HA) or reduce HA. Both options are deliberate decisions that must be documented.


AP2 – Infinite Retention as Default

Pattern: S3 buckets without lifecycle policy, CloudWatch log groups without retention, RDS snapshots without expiry date. "Keeping it costs almost nothing" – until the total volume grows.

Cost impact: Continuously rising storage costs without proportional business value. Observability costs dominate the cloud budget.

Detection signal: Storage costs grow linearly or faster without proportional data growth. Observability costs as a share of total budget > 20%.

Paydown: Lifecycle policies for all storage and log resources. Tiered retention: Hot (0–30d), Warm (30–90d), Cold (90–365d), Archive (>365d). Tagging for value-based prioritization.


AP3 – Lock-in without Exit Plan

Pattern: Heavy use of proprietary services (AWS Kinesis, Azure Service Bus, GCP Spanner, Snowflake) without documented alternative or exit strategy.

Cost impact: High and growing license/service costs without negotiating position. Provider price increases cannot be answered with a provider switch.

Detection signal: Enterprise agreement renewal without alternative analysis. New services chosen for convenience rather than evaluation.

Paydown: Document exit plan (even without migration intent). Lock-in score in ADRs (1–5). Score 4–5 requires architecture board approval.


AP4 – Observability Over-Engineering

Pattern: DEBUG level logging in production environments, no trace sampling, all logs in hot tier without tiering strategy.

Cost impact: Observability costs exceed compute costs. Expensive APM licenses for data nobody evaluates.

Detection signal: Log ingestion volume grows without proportional incident detection benefit. DEBUG logs in CloudWatch Logs Insights generate no alerts.

Paydown: Log level review, sampling configuration for traces, retention tiering. Value-based analysis: which logs were used in an alert or incident in the last 90 days?


AP5 – Orphaned Resources

Pattern: Dev/test resources that were not shut down after project completion. Unused reserved instances that no longer match the current workload. Snapshots and backups without an active workload.

Cost impact: Direct waste without any business value.

Detection signal: Resources with environment: development or environment: test without active usage in the last 7 days. No CPU/network activity > 5%.

Paydown: Idle detection automation. Auto-shutdown policy for non-prod outside business hours. Reservation audit quarterly.

Detection Signals

The following signals indicate accumulating Architectural Cost Debt:

Signal Description Threshold (guidance)

Fixed Cost Drift

Fixed costs (reservations, support, licenses) grow without proportional business growth.

Fixed cost share > 60% of cloud budget without commitment strategy

Observability Dominance

Logging/monitoring costs exceed compute costs.

Observability > 20% of total cloud budget

HA Cost Share

Redundancy costs (multi-AZ, multi-region) without documented SLO requirement.

HA overhead > 30% without SLO evidence

Egress Dominance

Data transfer costs grow faster than data volume.

Egress > 15% of total cloud budget

Storage Growth

Storage costs grow linearly without lifecycle policies.

Monthly storage growth > 5% without new workloads

Unused Commitment

Reserved instances or savings plans have < 50% coverage rate.

RI utilization < 80%

Impact Assessment

For each identified cost debt entry, an impact assessment is performed:

# cost-debt-register.yml – Example entry
- id: CD-2025-003
  title: "Multi-AZ PostgreSQL without SLO basis"
  description: >
    Production database deployed in 3 AZs. SLO of the service is 99.5%,
    which would also be achievable with single-AZ. No documented SLO requirement
    for multi-AZ.
  detected: "2025-03-01"
  owner: "platform-team"
  estimated_annual_impact_eur: 18400
  status: "paydown"  # monitoring | paydown | accepted
  paydown_plan: >
    SLO review with product owner by 2025-04-15.
    If SLO <= 99.5%: downgrade to single-AZ with multi-AZ for DR tests only.
    Expected saving: ~1,530 EUR/month.
  target_resolution: "2025-06-30"
  related_adr: "docs/adr/ADR-0042-database-ha-strategy.md"
  related_controls:
    - WAF-COST-050
    - WAF-COST-100

Governance Approach

Cost Debt Register

The cost debt register is a versioned YAML or Markdown document in the repository:

  • Each entry has: ID, title, description, owner, annual impact (€), status, paydown plan or acceptance rationale

  • Status options: monitoring (being observed), paydown (actively reduced), accepted (consciously accepted with rationale)

  • Link to the related ADR (if available)

Quarterly Cost Debt Review

The architecture board conducts a quarterly cost debt review:

  1. Identify new entries: From FinOps reviews, anomaly detection, ADR processes

  2. Update existing entries: Paydown progress, changed status

  3. Prioritization: By impact (€/year) and paydown effort

  4. Acceptance decisions: Conscious acceptance with documented rationale

  5. Sign-off: Architecture board confirms awareness and prioritization

ADR Integration

Every architecture decision record with infrastructure impact includes a cost impact section:

## Cost Impact Assessment (WAF-COST-050)

| Dimension            | Assessment                         |
|---------------------|------------------------------------|
| Estimated TCO/year  | 24,000 EUR                         |
| Lock-in risk        | 3/5 (Medium lock-in)               |
| Data transfer costs | ~200 EUR/month (egress estimate)   |
| Operational effort  | 0.3 FTE/year                       |
| Exit costs (est.)   | ~50,000 EUR migration effort       |
| 3-year NPV          | -68,000 EUR (excl. business value) |

**Decision:** Accepted because [rationale].
**Cost debt risk:** Low / Medium / High
**Cost Debt Register:** Entry CD-2025-XXX created (if risk is Medium/High).

Difference from Technical Debt

Architectural Cost Debt differs from technical debt in important ways:

  • Visibility: Technical debt is visible in code. Cost debt hides in cloud bills

  • Measurement: Technical debt is measured in code metrics. Cost debt in €/month

  • Ownership: Technical debt belongs to the engineering team. Cost debt requires architecture board involvement

  • Paydown complexity: Technical debt can often be reduced incrementally. Cost debt often requires structural architecture changes