Architectural Cost Debt
Architectural Cost Debt is the central new concept of the Cost Optimization pillar. It refers to economic burdens arising from past architecture decisions that now generate monthly costs – often without the connection to the original decision still being visible.
Definition
Architectural Cost Debt is the accumulation of long-term economic impacts of architectural decisions that, at the time of the decision, were not fully assessed or were consciously accepted.
Architectural Cost Debt arises from:
├── Decisions without cost impact assessment
├── HA/multi-region without SLO basis ("let's make it safe")
├── Managed services with high lock-in without exit plan
├── Infinite retention as default ("we keep everything")
├── Observability over-engineering (DEBUG in production, no sampling)
└── Missing termination of expired usage (forgotten reservations, snapshots, buckets)
The analogy to technical debt is deliberate:
| Dimension | Technical Debt | Architectural Cost Debt |
|---|---|---|
Origin |
Poor code, missing tests, outdated dependencies |
Missing TCO assessment, lock-in without alternative plan, HA over-engineering |
Visibility |
Often visible in code review, Sonar metrics, incident patterns |
Often invisible: buried in cloud bills, not attributed to any decision |
Interest |
Slowed development, growing bug rate, harder onboarding |
Monthly additional costs, rising fixed costs without business growth |
Paydown |
Refactoring, upgrades, increasing test coverage |
Architecture revisions, exiting lock-in services, retention cleanup |
Governance |
Tech debt register, sprint allocation for debt reduction |
Cost debt register, quarterly cost debt review with architecture board |
Typical Anti-Patterns (The 5 most common)
AP1 – HA without SLO basis
Pattern: Multi-AZ or multi-region deployment for services whose SLO requires 99.5% or less – which would be achievable with a single AZ.
Cost impact: 2–3x infrastructure costs for redundancy that covers no business requirement.
Detection signal: Service SLO is below what the provider guarantees for a single AZ. Or: SLO was never documented, but HA architecture was built anyway.
Paydown: SLO review → decision: raise SLO (and keep HA) or reduce HA. Both options are deliberate decisions that must be documented.
AP2 – Infinite Retention as Default
Pattern: S3 buckets without lifecycle policy, CloudWatch log groups without retention, RDS snapshots without expiry date. "Keeping it costs almost nothing" – until the total volume grows.
Cost impact: Continuously rising storage costs without proportional business value. Observability costs dominate the cloud budget.
Detection signal: Storage costs grow linearly or faster without proportional data growth. Observability costs as a share of total budget > 20%.
Paydown: Lifecycle policies for all storage and log resources. Tiered retention: Hot (0–30d), Warm (30–90d), Cold (90–365d), Archive (>365d). Tagging for value-based prioritization.
AP3 – Lock-in without Exit Plan
Pattern: Heavy use of proprietary services (AWS Kinesis, Azure Service Bus, GCP Spanner, Snowflake) without documented alternative or exit strategy.
Cost impact: High and growing license/service costs without negotiating position. Provider price increases cannot be answered with a provider switch.
Detection signal: Enterprise agreement renewal without alternative analysis. New services chosen for convenience rather than evaluation.
Paydown: Document exit plan (even without migration intent). Lock-in score in ADRs (1–5). Score 4–5 requires architecture board approval.
AP4 – Observability Over-Engineering
Pattern: DEBUG level logging in production environments, no trace sampling, all logs in hot tier without tiering strategy.
Cost impact: Observability costs exceed compute costs. Expensive APM licenses for data nobody evaluates.
Detection signal: Log ingestion volume grows without proportional incident detection benefit. DEBUG logs in CloudWatch Logs Insights generate no alerts.
Paydown: Log level review, sampling configuration for traces, retention tiering. Value-based analysis: which logs were used in an alert or incident in the last 90 days?
AP5 – Orphaned Resources
Pattern: Dev/test resources that were not shut down after project completion. Unused reserved instances that no longer match the current workload. Snapshots and backups without an active workload.
Cost impact: Direct waste without any business value.
Detection signal: Resources with environment: development or environment: test without
active usage in the last 7 days. No CPU/network activity > 5%.
Paydown: Idle detection automation. Auto-shutdown policy for non-prod outside business hours. Reservation audit quarterly.
Detection Signals
The following signals indicate accumulating Architectural Cost Debt:
| Signal | Description | Threshold (guidance) |
|---|---|---|
Fixed Cost Drift |
Fixed costs (reservations, support, licenses) grow without proportional business growth. |
Fixed cost share > 60% of cloud budget without commitment strategy |
Observability Dominance |
Logging/monitoring costs exceed compute costs. |
Observability > 20% of total cloud budget |
HA Cost Share |
Redundancy costs (multi-AZ, multi-region) without documented SLO requirement. |
HA overhead > 30% without SLO evidence |
Egress Dominance |
Data transfer costs grow faster than data volume. |
Egress > 15% of total cloud budget |
Storage Growth |
Storage costs grow linearly without lifecycle policies. |
Monthly storage growth > 5% without new workloads |
Unused Commitment |
Reserved instances or savings plans have < 50% coverage rate. |
RI utilization < 80% |
Impact Assessment
For each identified cost debt entry, an impact assessment is performed:
# cost-debt-register.yml – Example entry
- id: CD-2025-003
title: "Multi-AZ PostgreSQL without SLO basis"
description: >
Production database deployed in 3 AZs. SLO of the service is 99.5%,
which would also be achievable with single-AZ. No documented SLO requirement
for multi-AZ.
detected: "2025-03-01"
owner: "platform-team"
estimated_annual_impact_eur: 18400
status: "paydown" # monitoring | paydown | accepted
paydown_plan: >
SLO review with product owner by 2025-04-15.
If SLO <= 99.5%: downgrade to single-AZ with multi-AZ for DR tests only.
Expected saving: ~1,530 EUR/month.
target_resolution: "2025-06-30"
related_adr: "docs/adr/ADR-0042-database-ha-strategy.md"
related_controls:
- WAF-COST-050
- WAF-COST-100
Governance Approach
Cost Debt Register
The cost debt register is a versioned YAML or Markdown document in the repository:
-
Each entry has: ID, title, description, owner, annual impact (€), status, paydown plan or acceptance rationale
-
Status options:
monitoring(being observed),paydown(actively reduced),accepted(consciously accepted with rationale) -
Link to the related ADR (if available)
Quarterly Cost Debt Review
The architecture board conducts a quarterly cost debt review:
-
Identify new entries: From FinOps reviews, anomaly detection, ADR processes
-
Update existing entries: Paydown progress, changed status
-
Prioritization: By impact (€/year) and paydown effort
-
Acceptance decisions: Conscious acceptance with documented rationale
-
Sign-off: Architecture board confirms awareness and prioritization
ADR Integration
Every architecture decision record with infrastructure impact includes a cost impact section:
## Cost Impact Assessment (WAF-COST-050)
| Dimension | Assessment |
|---------------------|------------------------------------|
| Estimated TCO/year | 24,000 EUR |
| Lock-in risk | 3/5 (Medium lock-in) |
| Data transfer costs | ~200 EUR/month (egress estimate) |
| Operational effort | 0.3 FTE/year |
| Exit costs (est.) | ~50,000 EUR migration effort |
| 3-year NPV | -68,000 EUR (excl. business value) |
**Decision:** Accepted because [rationale].
**Cost debt risk:** Low / Medium / High
**Cost Debt Register:** Entry CD-2025-XXX created (if risk is Medium/High).
Difference from Technical Debt
Architectural Cost Debt differs from technical debt in important ways:
-
Visibility: Technical debt is visible in code. Cost debt hides in cloud bills
-
Measurement: Technical debt is measured in code metrics. Cost debt in €/month
-
Ownership: Technical debt belongs to the engineering team. Cost debt requires architecture board involvement
-
Paydown complexity: Technical debt can often be reduced incrementally. Cost debt often requires structural architecture changes