Best Practice: Managing Architectural Cost Debt
Context
Architectural cost debt accumulates silently. Every architectural decision made without a complete cost impact assessment is a potential cost debt entry. HA designs without an SLO basis, proprietary services without an exit plan, retention defaults without a lifecycle strategy – they all accumulate monthly costs whose origin can no longer be traced back to the original decision.
This best practice shows how cost debt is systematically prevented, detected, and reduced.
Related Controls
-
WAF-COST-050 – Cost Impact Assessment in Architecture Decision Records
-
WAF-COST-100 – Architectural Cost Debt Register & Quarterly Review
Target State
-
All architectural decisions with infrastructure impact have a complete cost impact assessment
-
Known cost debts are recorded in the cost debt register with owner and status
-
Quarterly architecture board review ensures no debt goes unnoticed
-
Deliberate acceptance of cost debt is possible – but documented
ADR Template with Cost Impact Section
Complete ADR Template
# ADR-XXXX: [Title of the Decision]
**Status:** [Proposed | Accepted | Superseded | Deprecated]
**Date:** YYYY-MM-DD
**Decision Makers:** [Team / Architecture Board]
## Context
[Description of the problem and decision context]
## Options
### Option A: [Name]
[Description]
### Option B: [Name]
[Description]
## Decision
[Chosen option and rationale]
## Consequences
[Positive and negative consequences of the decision]
---
## Cost Impact Assessment (WAF-COST-050)
> Mandatory section for all ADRs with infrastructure impact.
| Dimension | Assessment |
|--------------------------|--------------------------------------------------|
| **Estimated TCO/Year** | XX,XXX EUR |
| **Lock-in Risk** | X/5 – [Rationale] |
| **Data Transfer Costs** | ~XXX EUR/month (estimated egress share) |
| **Operational Effort** | X.X FTE/year (ops + engineering) |
| **Exit Costs (est.)** | ~XX,XXX EUR migration effort |
| **3-Year NPV** | -XX,XXX EUR (infrastructure, excl. business value)|
**Lock-in Score Explanation:**
- 1: Standard APIs, open format, easy vendor switch
- 2: Low proprietary nature, alternatives available
- 3: Medium lock-in, migration possible with manageable effort
- 4: High lock-in, migration costly (requires AB approval)
- 5: Extreme lock-in, migration very costly or practically impossible
**Cost Debt Risk:** [Low / Medium / High]
**Rationale:**
[Why is this option the right choice despite the cost debt risk?]
**Cost Debt Register:**
[Entry CD-YYYY-XXX created / No entry required because: ...]
**Next Review:** [Date of the next cost debt review]
Example: Completed Cost Impact Assessment
## Cost Impact Assessment (WAF-COST-050)
| Dimension | Assessment |
|--------------------------|--------------------------------------------------|
| **Estimated TCO/Year** | 48,000 EUR |
| **Lock-in Risk** | 4/5 – Proprietary data format, high migration |
| | effort to alternative streaming services |
| **Data Transfer Costs** | ~400 EUR/month (Kinesis→S3 egress estimate) |
| **Operational Effort** | 0.3 FTE/year (monitoring, shard management) |
| **Exit Costs (est.)** | ~80,000 EUR (consumer migration to Kafka/etc.) |
| **3-Year NPV** | -224,000 EUR |
**Cost Debt Risk:** High
**Rationale:**
Kinesis is the best-integrated AWS-native streaming solution for our
existing AWS stack. The lock-in is deliberately accepted because:
- No multi-cloud requirement in the next 3 years (Architecture Decision AD-2023-11)
- Self-hosting Kafka would cost ~0.8 FTE more in operational effort
- Exit plan documented: consumers can be migrated to Apache Kafka on MSK
**Cost Debt Register:** Entry CD-2025-007 created.
**Next Review:** Q3 2025 (Architecture Board)
Setting Up the Cost Debt Register
File Structure
repository/
├── docs/
│ ├── cost-debt-register.yml # Main register
│ └── adr/
│ └── ADR-0042-*.md
└── infrastructure/
└── ...
Register Format
# docs/cost-debt-register.yml
version: "1.0"
last_reviewed: "2025-03-01"
reviewed_by: "Architecture Board"
next_review: "2025-06-01"
entries:
- id: CD-2025-001
title: "Multi-AZ PostgreSQL without SLO basis"
category: "ha-over-engineering"
description: >
Production database deployed in 3 AZs. Current SLO of the service is 99.5%,
which would also be met with single-AZ (AWS guarantee 99.95% for single AZ).
No documented business requirement for multi-AZ.
detected: "2025-01-15"
owner: "platform-team"
estimated_annual_impact_eur: 22000
status: "paydown"
paydown_plan: >
SLO review with product owner by 2025-04-15.
If SLO <= 99.5%: downgrade to single-AZ, multi-AZ only for DR tests.
Expected saving: ~1,833 EUR/month.
target_resolution: "2025-06-30"
related_adr: "docs/adr/ADR-0038-database-ha.md"
related_controls: [WAF-COST-050, WAF-COST-100]
- id: CD-2025-002
title: "CloudWatch Logs without retention – 3 production log groups"
category: "infinite-retention"
description: >
Three CloudWatch Log Groups from 2022 have retention_in_days = 0 (unlimited).
Cumulative log volume: ~800 GB. Growing at ~15 GB/month.
detected: "2025-02-01"
owner: "infrastructure-team"
estimated_annual_impact_eur: 1200
status: "paydown"
paydown_plan: >
Terraform configuration with retention_in_days = 90 for all three log groups.
PR planned for Sprint 2025-03-1.
target_resolution: "2025-03-15"
related_controls: [WAF-COST-040, WAF-COST-070]
- id: CD-2025-003
title: "AWS Kinesis Data Streams – lock-in without exit plan"
category: "lock-in"
description: >
Kinesis-based event streaming with high lock-in (score 4/5).
Exit plan not yet documented.
detected: "2025-03-01"
owner: "data-team"
estimated_annual_impact_eur: 48000
status: "monitoring"
paydown_plan: "Not planned for 2025. Document exit plan by Q2 2025."
acceptance_rationale: >
Deliberately accepted: Kinesis is optimal for AWS-native stack. Operational savings
vs. self-hosted Kafka outweigh lock-in risk.
Prerequisite: exit plan document by Q2 2025.
target_resolution: "2025-06-30"
related_adr: "docs/adr/ADR-0042-streaming-platform.md"
related_controls: [WAF-COST-050]
Quarterly Cost Debt Review Process
Preparation (1 week before review)
-
FinOps team creates a preparatory report:
-
New cost debt candidates from the last quarter
-
Paydown progress of existing entries
-
Cost anomalies from the last 90 days
-
ADRs from the quarter with lock-in score >= 3
-
-
Owners of each cost debt entry update their status in advance
Review Agenda (60–90 minutes)
-
Quarterly report (10 min): overall cost trend, biggest changes
-
New entries (20 min): for each new addition – confirm owner, impact assessment, status decision
-
Existing entries (20 min): paydown progress, changed assessments, status updates
-
Acceptance decisions (15 min): deliberate acceptance with documented rationale
-
Prioritization (15 min): top 3 paydown measures for the next quarter
Review Minutes
# docs/cost-debt-reviews/2025-Q1-review.yml
quarter: "2025-Q1"
date: "2025-03-01"
attendees:
- "CTO (Architecture Board Chair)"
- "Principal Engineer Platform"
- "FinOps Lead"
- "Team Leads: platform, data, infrastructure"
new_entries:
- id: CD-2025-003
decision: "Accepted – exit plan required by Q2 2025"
paydown_updates:
- id: CD-2025-001
progress: "SLO review planned for April. No paydown in Q1."
- id: CD-2025-002
progress: "COMPLETED. PR merged 2025-03-10. Saving: 100 EUR/month confirmed."
strategic_decisions:
- "No new Kinesis deployment without AB approval (lock-in score 4)"
- "All new ADRs with infrastructure impact: cost impact assessment mandatory effective immediately"
next_review: "2025-06-01"
sign_off: "Architecture Board"
Common Anti-Patterns
-
Register as PDF: version control and ADR linking impossible
-
No owner assignment: entries without owners are not paid down
-
Acceptance without rationale: "Accepted" without documented reason is wasted documentation effort
-
Review as status report only: review must produce decisions, not just report status
Metrics
-
Register completeness: all known cost debts recorded (100%)
-
Paydown rate: share of entries with an active paydown plan (target: >= 50%)
-
Acceptance ratio: share of deliberately accepted debts (benchmark: < 30%)
-
Average impact per entry: total impact ÷ number of entries (KPI for prioritization)