Best Practice: Performance Observability & SLOs

Context

Without defined SLOs there is no objective criterion for "our performance is good enough". Teams debate subjectively whether latency is "acceptable". Without error budgets there is no objective basis for the decision "we need to prioritize reliability work" vs. "we can keep building features".

Typical problems without SLO-based performance monitoring:

Averages mask P99 spikes: Avg = 50ms, P99 = 5000ms – users suffer, dashboard looks fine
Performance degradation accepted as "baseline" because no historical comparison exists
On-call team is alerted with wrong thresholds (too sensitive or too tolerant)
Deployments happen without knowing whether performance got worse

Related Controls

WAF-PERF-050 – Performance Monitoring & SLO Definition
WAF-PERF-100 – Performance Debt Register & Quarterly Review

Target State

SLO-based performance management:

Defined: SLOs for all production services with P95/P99, error rate, availability
Instrumented: SLIs are continuously measured (not just sampled)
Alerting: Burn-rate alerts instead of static thresholds
Error Budget: Deployment decisions are based on budget status

Technical Implementation

Step 1: Create SLO Document

# docs/slos/payment-api.yml
version: "1.0"
service: "payment-api"
owner: "payments-team"
last_reviewed: "2026-03-18"

slos:
  - name: availability
    description: "Percentage of requests that succeed (status 2xx or 4xx)"
    sli:
      numerator: "count of HTTP requests with status not in [5xx]"
      denominator: "count of all HTTP requests"
    target: 99.9    # 99.9% = 8.77h error budget per year
    window: "30d"

  - name: latency_p95
    description: "95th percentile request latency at load balancer"
    sli: "p95 of request duration measured at ALB"
    target: 200   # ms
    window: "30d"

  - name: latency_p99
    description: "99th percentile request latency"
    sli: "p99 of request duration measured at ALB"
    target: 500   # ms
    window: "30d"

error_budgets:
  availability_30d:
    slo: 99.9
    window: "30d"
    total_minutes: 43200
    allowed_downtime_minutes: 43.2  # 0.1% of 30 days

deployment_policy:
  - condition: "error_budget_remaining > 50%"
    action: "deploy freely"
  - condition: "error_budget_remaining 10-50%"
    action: "deploy with extra caution; require load test"
  - condition: "error_budget_remaining < 10%"
    action: "freeze new features; focus on reliability"

review_cadence: "quarterly"

Step 2: CloudWatch SLO Dashboard (AWS)

resource "aws_cloudwatch_metric_alarm" "p99_slo_burn" {
  alarm_name          = "payment-api-p99-slo-burn-rate"
  alarm_description   = "P99 latency SLO burn rate high – investigate immediately. SLO: < 500ms P99. Runbook: https://wiki/slo-runbook"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 500

  metric_query {
    id          = "p99_latency"
    return_data = true
    metric {
      namespace   = "AWS/ApplicationELB"
      metric_name = "TargetResponseTime"
      period      = 60
      stat        = "p99"
      dimensions = {
        LoadBalancer = aws_lb.api.arn_suffix
        TargetGroup  = aws_lb_target_group.api.arn_suffix
      }
    }
  }

  alarm_actions             = [aws_sns_topic.slo_alerts.arn]
  ok_actions                = [aws_sns_topic.slo_alerts.arn]
  insufficient_data_actions = [aws_sns_topic.slo_alerts.arn]

  treat_missing_data = "breaching"  # Missing data = SLO violated (fail safe)
}

# Composite Alarm: Both conditions for "SLO breached"
resource "aws_cloudwatch_composite_alarm" "slo_breach" {
  alarm_name        = "payment-api-slo-breach"
  alarm_description = "Payment API SLO breached – P99 > 500ms AND error rate > 0.1%"
  alarm_rule        = "ALARM(${aws_cloudwatch_metric_alarm.p99_slo_burn.alarm_name}) OR ALARM(${aws_cloudwatch_metric_alarm.error_rate.alarm_name})"
  alarm_actions     = [aws_sns_topic.critical_alerts.arn]
}

Step 3: Multi-Window Burn Rate Alerts (Google SRE Method)

# scripts/slo-burn-rate-check.py
# Implements the Google SRE multi-window burn rate alert pattern

SLO_TARGET = 0.999  # 99.9% availability
ERROR_BUDGET_RATIO = 1 - SLO_TARGET  # 0.001 = 0.1%

# Burn Rate Windows (Google SRE recommendation)
ALERT_WINDOWS = [
    # (short_window_hours, long_window_hours, burn_rate_threshold, severity)
    (1, 6, 14.4, "page"),    # 1h window, 6h window, 14.4x burn rate
    (6, 24, 6.0, "page"),    # 6h window, 24h window, 6x burn rate
    (24, 72, 3.0, "ticket"), # 24h window, 72h window, 3x burn rate
]

def calculate_burn_rate(error_rate: float, window_hours: float) -> float:
    """Burn rate = current error rate / error rate that exactly exhausts the budget."""
    slo_error_rate = 1 - SLO_TARGET
    return error_rate / slo_error_rate

def check_burn_rates(short_error_rate: float, long_error_rate: float,
                     short_hours: float, long_hours: float,
                     threshold: float, severity: str) -> bool:
    short_burn = calculate_burn_rate(short_error_rate, short_hours)
    long_burn = calculate_burn_rate(long_error_rate, long_hours)

    if short_burn > threshold and long_burn > threshold:
        print(f"🔴 ALERT [{severity.upper()}]: Burn rate {short_burn:.1f}x "
              f"(threshold: {threshold}x) – SLO exhaustion predicted")
        return True
    return False

Step 4: Error Budget Dashboard

{
  "slo_dashboard": {
    "service": "payment-api",
    "period": "last_30_days",
    "widgets": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "value_expression": "100 * (1 - error_rate / 0.001)",
        "thresholds": {
          "red": 10,
          "yellow": 25,
          "green": 100
        }
      },
      {
        "title": "P99 Latency vs SLO",
        "type": "timeseries",
        "metrics": ["p99_latency"],
        "threshold_line": 500
      },
      {
        "title": "SLO Burn Rate (1h window)",
        "type": "timeseries",
        "reference_lines": [
          {"value": 14.4, "label": "Page threshold"},
          {"value": 6.0, "label": "Ticket threshold"}
        ]
      }
    ]
  }
}

Performance Debt Register Template

## Performance Debt Register – Payment Service

| ID | Description | Impact | Affected Services | Owner | Priority | Target Date |
|----|-------------|--------|-------------------|-------|----------|-------------|
| PERF-001 | CDN not configured – static assets served directly from origin | +50ms for static-assets | payment-frontend | @platform-team | Medium | Q2 2026 |
| PERF-002 | gp2 EBS Volume on db-payment-staging | Burst-depletion risk | payment-db | @dba-team | High | 2026-04-01 |
| PERF-003 | No connection pooling before RDS | Pool exhaustion at > 200 concurrent users | payment-api | @payments-team | High | 2026-03-31 |

### Review History
- 2026-03-18: Quarterly Review – PERF-003 added, PERF-001 reprioritized
- 2025-12-15: PERF-002 identified after storage I/O incident

Common Anti-Patterns

Alerting on averages: Average latency alerts miss P99 spikes; always alert on percentiles.
Too many alerts: Alert fatigue leads to ignored alerts; prioritize by severity and burn rate.
SLOs without review: SLOs that are never adjusted become irrelevant – review quarterly.
Error budget observed but not used: A budget without a policy ("at < 10%: feature freeze") has no value.

Metrics

SLO compliance rate (proportion of days in the last 30 days with SLO met)
Error budget burn rate (current; alert at > 6x)
Alert precision (proportion of alerts that are actually actionable; target: > 80%)
Time to detect performance degradation (from onset to alert; target: < 5 minutes)

Maturity Level

Level 1 – No SLO; performance degradation discovered by users
Level 2 – Informal targets; average value alerting
Level 3 – Formal SLOs; P99 alerting; instrumented SLIs
Level 4 – Error budget management; deployment gates
Level 5 – Predictive burn-rate alerts; automatic capacity adjustment