Best Practice: Performance-Observability & SLOs

Kontext

Ohne definierte SLOs gibt es kein objektives Kriterium für "unsere Performance ist gut genug". Teams diskutieren subjektiv, ob Latenz "akzeptabel" ist. Ohne Error Budgets gibt es keine objektive Grundlage für die Entscheidung "wir müssen Reliability-Arbeit priorisieren" vs. "wir können weiter Features bauen".

Typische Probleme ohne SLO-basiertes Performance-Monitoring:

Durchschnittswerte maskieren P99-Spitzen: Avg = 50ms, P99 = 5000ms – User leiden, Dashboard sieht gut aus
Performance-Degradation als "Baseline" akzeptiert, weil kein historischer Vergleich existiert
On-Call-Team wird bei falschen Schwellwerten alarmiert (zu sensitiv oder zu tolerant)
Deployments erfolgen, ohne zu wissen ob die Performance schlechter wurde

Zugehörige Controls

WAF-PERF-050 – Performance Monitoring & SLO Definition
WAF-PERF-100 – Performance Debt Register & Quarterly Review

Zielbild

SLO-basiertes Performance-Management:

Definiert: SLOs für alle Produktions-Services mit P95/P99, Fehlerrate, Verfügbarkeit
Instrumentiert: SLIs werden kontinuierlich gemessen (nicht nur Stichproben)
Alerting: Burn-Rate-Alerts statt statischer Schwellenwerte
Error Budget: Deployment-Entscheidungen basieren auf Budget-Status

Technische Umsetzung

Schritt 1: SLO-Dokument erstellen

# docs/slos/payment-api.yml
version: "1.0"
service: "payment-api"
owner: "payments-team"
last_reviewed: "2026-03-18"

slos:
  - name: availability
    description: "Percentage of requests that succeed (status 2xx or 4xx)"
    sli:
      numerator: "count of HTTP requests with status not in [5xx]"
      denominator: "count of all HTTP requests"
    target: 99.9    # 99.9% = 8.77h error budget per year
    window: "30d"

  - name: latency_p95
    description: "95th percentile request latency at load balancer"
    sli: "p95 of request duration measured at ALB"
    target: 200   # ms
    window: "30d"

  - name: latency_p99
    description: "99th percentile request latency"
    sli: "p99 of request duration measured at ALB"
    target: 500   # ms
    window: "30d"

error_budgets:
  availability_30d:
    slo: 99.9
    window: "30d"
    total_minutes: 43200
    allowed_downtime_minutes: 43.2  # 0.1% of 30 days

deployment_policy:
  - condition: "error_budget_remaining > 50%"
    action: "deploy freely"
  - condition: "error_budget_remaining 10-50%"
    action: "deploy with extra caution; require load test"
  - condition: "error_budget_remaining < 10%"
    action: "freeze new features; focus on reliability"

review_cadence: "quarterly"

Schritt 2: CloudWatch SLO-Dashboard (AWS)

resource "aws_cloudwatch_metric_alarm" "p99_slo_burn" {
  alarm_name          = "payment-api-p99-slo-burn-rate"
  alarm_description   = "P99 latency SLO burn rate high – investigate immediately. SLO: < 500ms P99. Runbook: https://wiki/slo-runbook"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 500

  metric_query {
    id          = "p99_latency"
    return_data = true
    metric {
      namespace   = "AWS/ApplicationELB"
      metric_name = "TargetResponseTime"
      period      = 60
      stat        = "p99"
      dimensions = {
        LoadBalancer = aws_lb.api.arn_suffix
        TargetGroup  = aws_lb_target_group.api.arn_suffix
      }
    }
  }

  alarm_actions             = [aws_sns_topic.slo_alerts.arn]
  ok_actions                = [aws_sns_topic.slo_alerts.arn]
  insufficient_data_actions = [aws_sns_topic.slo_alerts.arn]

  treat_missing_data = "breaching"  # Missing data = SLO verletzt (fail safe)
}

# Composite Alarm: Beide Bedingungen für "SLO breached"
resource "aws_cloudwatch_composite_alarm" "slo_breach" {
  alarm_name        = "payment-api-slo-breach"
  alarm_description = "Payment API SLO breached – P99 > 500ms AND error rate > 0.1%"
  alarm_rule        = "ALARM(${aws_cloudwatch_metric_alarm.p99_slo_burn.alarm_name}) OR ALARM(${aws_cloudwatch_metric_alarm.error_rate.alarm_name})"
  alarm_actions     = [aws_sns_topic.critical_alerts.arn]
}

Schritt 3: Multi-Window Burn Rate Alerts (Google SRE Methode)

# scripts/slo-burn-rate-check.py
# Implementiert das Google SRE multi-window burn rate alert pattern

SLO_TARGET = 0.999  # 99.9% availability
ERROR_BUDGET_RATIO = 1 - SLO_TARGET  # 0.001 = 0.1%

# Burn Rate Windows (Google SRE Empfehlung)
ALERT_WINDOWS = [
    # (short_window_hours, long_window_hours, burn_rate_threshold, severity)
    (1, 6, 14.4, "page"),    # 1h window, 6h window, 14.4x burn rate
    (6, 24, 6.0, "page"),    # 6h window, 24h window, 6x burn rate
    (24, 72, 3.0, "ticket"), # 24h window, 72h window, 3x burn rate
]

def calculate_burn_rate(error_rate: float, window_hours: float) -> float:
    """Burn Rate = aktuelle Fehlerrate / Fehlerrate, die Budget genau aufbraucht."""
    slo_error_rate = 1 - SLO_TARGET
    return error_rate / slo_error_rate

def check_burn_rates(short_error_rate: float, long_error_rate: float,
                     short_hours: float, long_hours: float,
                     threshold: float, severity: str) -> bool:
    short_burn = calculate_burn_rate(short_error_rate, short_hours)
    long_burn = calculate_burn_rate(long_error_rate, long_hours)

    if short_burn > threshold and long_burn > threshold:
        print(f"🔴 ALERT [{severity.upper()}]: Burn rate {short_burn:.1f}x "
              f"(threshold: {threshold}x) – SLO exhaustion predicted")
        return True
    return False

Schritt 4: Error-Budget-Dashboard

{
  "slo_dashboard": {
    "service": "payment-api",
    "period": "last_30_days",
    "widgets": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "value_expression": "100 * (1 - error_rate / 0.001)",
        "thresholds": {
          "red": 10,
          "yellow": 25,
          "green": 100
        }
      },
      {
        "title": "P99 Latency vs SLO",
        "type": "timeseries",
        "metrics": ["p99_latency"],
        "threshold_line": 500
      },
      {
        "title": "SLO Burn Rate (1h window)",
        "type": "timeseries",
        "reference_lines": [
          {"value": 14.4, "label": "Page threshold"},
          {"value": 6.0, "label": "Ticket threshold"}
        ]
      }
    ]
  }
}

Performance Debt Register Template

## Performance Debt Register – Payment Service

| ID | Beschreibung | Impact | Betroffene Services | Owner | Priorität | Zieldatum |
|----|--------------|--------|---------------------|-------|-----------|-----------|
| PERF-001 | CDN nicht konfiguriert – statische Assets direkt von Origin | +50ms für static-assets | payment-frontend | @platform-team | Medium | Q2 2026 |
| PERF-002 | gp2 EBS Volume auf db-payment-staging | Burst-Depletion Risiko | payment-db | @dba-team | High | 2026-04-01 |
| PERF-003 | Kein Connection Pooling vor RDS | Pool-Erschöpfung bei > 200 concurrent users | payment-api | @payments-team | High | 2026-03-31 |

### Review History
- 2026-03-18: Quarterly Review – PERF-003 hinzugefügt, PERF-001 repriorisiert
- 2025-12-15: PERF-002 identifiziert nach Storage I/O Incident

Typische Fehlmuster

Alerting auf Durchschnittswerte: Average-Latenz-Alerts verpassen P99-Spitzen; immer auf Percentile alerten.
Zu viele Alerts: Alert Fatigue führt zu ignorierten Alerts; Prioritäten nach Severity und Burn Rate.
SLOs ohne Review: SLOs die nie angepasst werden werden irrelevant – quartalsweise reviewen.
Error Budget nur beobachten, nicht nutzen: Budget ohne Policy ("bei < 10%: Feature-Freeze") hat keinen Wert.

Metriken

SLO-Compliance-Rate (Anteil Tage in der letzten 30 Tage mit erfülltem SLO)
Error-Budget-Burn-Rate (aktuell; Alarm bei > 6x)
Alert-Precision (Anteil Alerts, die tatsächlich actionable sind; Ziel: > 80%)
Time-to-Detect Performance Degradation (von Beginn bis Alert; Ziel: < 5 Minuten)

Reifegrad

Level 1 – Kein SLO; Performance-Degradation durch Users entdeckt
Level 2 – Informelle Ziele; Durchschnittswert-Alerting
Level 3 – Formale SLOs; P99-Alerting; instrumentierte SLIs
Level 4 – Error-Budget-Management; Deployment-Gates
Level 5 – Prädiktive Burn-Rate-Alerts; automatische Kapazitätsanpassung