WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Managing Architectural Cost Debt

Context

Architectural cost debt accumulates silently. Every architectural decision made without a complete cost impact assessment is a potential cost debt entry. HA designs without an SLO basis, proprietary services without an exit plan, retention defaults without a lifecycle strategy – they all accumulate monthly costs whose origin can no longer be traced back to the original decision.

This best practice shows how cost debt is systematically prevented, detected, and reduced.

  • WAF-COST-050 – Cost Impact Assessment in Architecture Decision Records

  • WAF-COST-100 – Architectural Cost Debt Register & Quarterly Review

Target State

  • All architectural decisions with infrastructure impact have a complete cost impact assessment

  • Known cost debts are recorded in the cost debt register with owner and status

  • Quarterly architecture board review ensures no debt goes unnoticed

  • Deliberate acceptance of cost debt is possible – but documented

ADR Template with Cost Impact Section

Complete ADR Template

# ADR-XXXX: [Title of the Decision]

**Status:** [Proposed | Accepted | Superseded | Deprecated]
**Date:** YYYY-MM-DD
**Decision Makers:** [Team / Architecture Board]

## Context

[Description of the problem and decision context]

## Options

### Option A: [Name]
[Description]

### Option B: [Name]
[Description]

## Decision

[Chosen option and rationale]

## Consequences

[Positive and negative consequences of the decision]

---

## Cost Impact Assessment (WAF-COST-050)

> Mandatory section for all ADRs with infrastructure impact.

| Dimension                | Assessment                                       |
|--------------------------|--------------------------------------------------|
| **Estimated TCO/Year**   | XX,XXX EUR                                       |
| **Lock-in Risk**         | X/5 – [Rationale]                               |
| **Data Transfer Costs**  | ~XXX EUR/month (estimated egress share)          |
| **Operational Effort**   | X.X FTE/year (ops + engineering)                 |
| **Exit Costs (est.)**    | ~XX,XXX EUR migration effort                     |
| **3-Year NPV**           | -XX,XXX EUR (infrastructure, excl. business value)|

**Lock-in Score Explanation:**
- 1: Standard APIs, open format, easy vendor switch
- 2: Low proprietary nature, alternatives available
- 3: Medium lock-in, migration possible with manageable effort
- 4: High lock-in, migration costly (requires AB approval)
- 5: Extreme lock-in, migration very costly or practically impossible

**Cost Debt Risk:** [Low / Medium / High]

**Rationale:**
[Why is this option the right choice despite the cost debt risk?]

**Cost Debt Register:**
[Entry CD-YYYY-XXX created / No entry required because: ...]

**Next Review:** [Date of the next cost debt review]

Example: Completed Cost Impact Assessment

## Cost Impact Assessment (WAF-COST-050)

| Dimension                | Assessment                                       |
|--------------------------|--------------------------------------------------|
| **Estimated TCO/Year**   | 48,000 EUR                                       |
| **Lock-in Risk**         | 4/5 – Proprietary data format, high migration    |
|                          | effort to alternative streaming services         |
| **Data Transfer Costs**  | ~400 EUR/month (Kinesis→S3 egress estimate)      |
| **Operational Effort**   | 0.3 FTE/year (monitoring, shard management)      |
| **Exit Costs (est.)**    | ~80,000 EUR (consumer migration to Kafka/etc.)   |
| **3-Year NPV**           | -224,000 EUR                                     |

**Cost Debt Risk:** High

**Rationale:**
Kinesis is the best-integrated AWS-native streaming solution for our
existing AWS stack. The lock-in is deliberately accepted because:
- No multi-cloud requirement in the next 3 years (Architecture Decision AD-2023-11)
- Self-hosting Kafka would cost ~0.8 FTE more in operational effort
- Exit plan documented: consumers can be migrated to Apache Kafka on MSK

**Cost Debt Register:** Entry CD-2025-007 created.
**Next Review:** Q3 2025 (Architecture Board)

Setting Up the Cost Debt Register

File Structure

repository/
├── docs/
│   ├── cost-debt-register.yml    # Main register
│   └── adr/
│       └── ADR-0042-*.md
└── infrastructure/
    └── ...

Register Format

# docs/cost-debt-register.yml
version: "1.0"
last_reviewed: "2025-03-01"
reviewed_by: "Architecture Board"
next_review: "2025-06-01"

entries:
  - id: CD-2025-001
    title: "Multi-AZ PostgreSQL without SLO basis"
    category: "ha-over-engineering"
    description: >
      Production database deployed in 3 AZs. Current SLO of the service is 99.5%,
      which would also be met with single-AZ (AWS guarantee 99.95% for single AZ).
      No documented business requirement for multi-AZ.
    detected: "2025-01-15"
    owner: "platform-team"
    estimated_annual_impact_eur: 22000
    status: "paydown"
    paydown_plan: >
      SLO review with product owner by 2025-04-15.
      If SLO <= 99.5%: downgrade to single-AZ, multi-AZ only for DR tests.
      Expected saving: ~1,833 EUR/month.
    target_resolution: "2025-06-30"
    related_adr: "docs/adr/ADR-0038-database-ha.md"
    related_controls: [WAF-COST-050, WAF-COST-100]

  - id: CD-2025-002
    title: "CloudWatch Logs without retention – 3 production log groups"
    category: "infinite-retention"
    description: >
      Three CloudWatch Log Groups from 2022 have retention_in_days = 0 (unlimited).
      Cumulative log volume: ~800 GB. Growing at ~15 GB/month.
    detected: "2025-02-01"
    owner: "infrastructure-team"
    estimated_annual_impact_eur: 1200
    status: "paydown"
    paydown_plan: >
      Terraform configuration with retention_in_days = 90 for all three log groups.
      PR planned for Sprint 2025-03-1.
    target_resolution: "2025-03-15"
    related_controls: [WAF-COST-040, WAF-COST-070]

  - id: CD-2025-003
    title: "AWS Kinesis Data Streams – lock-in without exit plan"
    category: "lock-in"
    description: >
      Kinesis-based event streaming with high lock-in (score 4/5).
      Exit plan not yet documented.
    detected: "2025-03-01"
    owner: "data-team"
    estimated_annual_impact_eur: 48000
    status: "monitoring"
    paydown_plan: "Not planned for 2025. Document exit plan by Q2 2025."
    acceptance_rationale: >
      Deliberately accepted: Kinesis is optimal for AWS-native stack. Operational savings
      vs. self-hosted Kafka outweigh lock-in risk.
      Prerequisite: exit plan document by Q2 2025.
    target_resolution: "2025-06-30"
    related_adr: "docs/adr/ADR-0042-streaming-platform.md"
    related_controls: [WAF-COST-050]

Quarterly Cost Debt Review Process

Preparation (1 week before review)

  1. FinOps team creates a preparatory report:

    • New cost debt candidates from the last quarter

    • Paydown progress of existing entries

    • Cost anomalies from the last 90 days

    • ADRs from the quarter with lock-in score >= 3

  2. Owners of each cost debt entry update their status in advance

Review Agenda (60–90 minutes)

  1. Quarterly report (10 min): overall cost trend, biggest changes

  2. New entries (20 min): for each new addition – confirm owner, impact assessment, status decision

  3. Existing entries (20 min): paydown progress, changed assessments, status updates

  4. Acceptance decisions (15 min): deliberate acceptance with documented rationale

  5. Prioritization (15 min): top 3 paydown measures for the next quarter

Review Minutes

# docs/cost-debt-reviews/2025-Q1-review.yml
quarter: "2025-Q1"
date: "2025-03-01"
attendees:
  - "CTO (Architecture Board Chair)"
  - "Principal Engineer Platform"
  - "FinOps Lead"
  - "Team Leads: platform, data, infrastructure"

new_entries:
  - id: CD-2025-003
    decision: "Accepted – exit plan required by Q2 2025"

paydown_updates:
  - id: CD-2025-001
    progress: "SLO review planned for April. No paydown in Q1."
  - id: CD-2025-002
    progress: "COMPLETED. PR merged 2025-03-10. Saving: 100 EUR/month confirmed."

strategic_decisions:
  - "No new Kinesis deployment without AB approval (lock-in score 4)"
  - "All new ADRs with infrastructure impact: cost impact assessment mandatory effective immediately"

next_review: "2025-06-01"
sign_off: "Architecture Board"

Common Anti-Patterns

  • Register as PDF: version control and ADR linking impossible

  • No owner assignment: entries without owners are not paid down

  • Acceptance without rationale: "Accepted" without documented reason is wasted documentation effort

  • Review as status report only: review must produce decisions, not just report status

Metrics

  • Register completeness: all known cost debts recorded (100%)

  • Paydown rate: share of entries with an active paydown plan (target: >= 50%)

  • Acceptance ratio: share of deliberately accepted debts (benchmark: < 30%)

  • Average impact per entry: total impact ÷ number of entries (KPI for prioritization)