Best Practice: Alerting on Symptoms, Not Causes

Context

The most common cause of on-call burnout is not the work itself – it is the wrong alerts. A team that receives 50 alerts per night, of which 45 are not actionable, will soon ignore all alerts.

Symptom-based alerting is the answer: only alert when users are affected.

Related Controls

Target State

A mature alerting system:

Every paging alert is symptom-based (error rate, latency, availability)
Every alert has a runbook URL
On-call engineers receive < 5 pages per shift (with 0 false positives)
SLOs are defined and burn-rate-based alerts are configured
Alert noise metric is tracked weekly

Technical Implementation

Step 1: Define SLOs

# slo-definitions.yaml (store in version control)
services:
  payment-service:
    slos:
      - name: availability
        description: "Availability of the Payment Service"
        target: 99.9  # 99.9% = 8.6 hours of downtime/year allowed
        window: 30d
        metric:
          good_events: "http_requests_total{service='payment',code!~'5..'}"
          total_events: "http_requests_total{service='payment'}"

      - name: latency-p99
        description: "p99 latency for payment requests"
        target: 99.0  # 99% of requests under 500ms
        window: 30d
        metric:
          good_events: "http_request_duration_seconds_bucket{service='payment',le='0.5'}"
          total_events: "http_requests_total{service='payment'}"

      - name: error-rate
        description: "Error rate of all requests"
        target: 99.9  # < 0.1% error rate
        window: 30d
        metric:
          good_events: "http_requests_total{service='payment',code!~'5..'}"
          total_events: "http_requests_total{service='payment'}"

Step 2: Configure Burn-Rate Alerts (Prometheus/AlertManager)

# prometheus-alerts.yaml
groups:
  - name: payment-service-slo
    rules:
      # Fast-Burn Alert: Page immediately when error budget burns 5x faster
      - alert: PaymentServiceHighErrorBudgetBurn
        expr: |
          (
            rate(http_requests_total{service="payment",code=~"5.."}[1h])
            /
            rate(http_requests_total{service="payment"}[1h])
          ) > (5 * 0.001)  # 5x burn rate, SLO target = 99.9% -> 0.1% error budget
        for: 2m
        labels:
          severity: critical
          team: payment
        annotations:
          summary: "Payment Service high error budget burn rate"
          description: "Error budget burning 5x faster than allowed. At this rate, monthly budget exhausted in {{ $value | humanizeDuration }}."
          runbook_url: "https://wiki.company.com/runbooks/payment-service/error-budget-burn"

      # Slow-Burn Alert: Create ticket (no page) when budget burns slowly
      - alert: PaymentServiceSlowErrorBudgetBurn
        expr: |
          (
            rate(http_requests_total{service="payment",code=~"5.."}[6h])
            /
            rate(http_requests_total{service="payment"}[6h])
          ) > (2 * 0.001)  # 2x burn rate over 6 hours
        for: 60m
        labels:
          severity: warning
          team: payment
        annotations:
          summary: "Payment Service slow error budget burn"
          description: "Error budget burning 2x expected rate. Review and address within 24 hours."
          runbook_url: "https://wiki.company.com/runbooks/payment-service/slow-error-burn"

      # Latency SLO Alert
      - alert: PaymentServiceHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket{service="payment"}[5m])
          ) > 0.5  # p99 > 500ms
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Payment Service p99 latency above SLO"
          description: "p99 latency is {{ $value | humanizeDuration }}, exceeding 500ms SLO threshold."
          runbook_url: "https://wiki.company.com/runbooks/payment-service/high-latency"

Step 3: Configure CloudWatch Symptom Alarm with Terraform

# Symptom-based alert: HTTP 5xx Error Rate
resource "aws_cloudwatch_metric_alarm" "payment_error_rate" {
  alarm_name          = "payment-service-5xx-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  datapoints_to_alarm = 2
  threshold           = 10
  treat_missing_data  = "notBreaching"

  metric_query {
    id          = "error_rate"
    expression  = "errors / total * 100"
    label       = "Error Rate %"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "HTTPCode_Target_5XX_Count"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
        TargetGroup  = aws_lb_target_group.app.arn_suffix
      }
    }
  }

  metric_query {
    id = "total"
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
        TargetGroup  = aws_lb_target_group.app.arn_suffix
      }
    }
  }

  alarm_description = <<-EOT
    Payment Service 5xx Error Rate > 10%

    Symptom: Users are receiving HTTP 5xx errors from the Payment Service.
    Runbook: https://wiki.company.com/runbooks/payment-service/5xx-errors
    Dashboard: https://monitoring.company.com/d/payment-service
    Escalation: payment-oncall@company.com → platform-team@company.com
  EOT

  alarm_actions = [aws_sns_topic.payment_oncall.arn]
  ok_actions    = [aws_sns_topic.payment_oncall.arn]
}

# Latency Alert (Symptom: Slow response times for users)
resource "aws_cloudwatch_metric_alarm" "payment_latency_p99" {
  alarm_name          = "payment-service-p99-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  datapoints_to_alarm = 2
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  extended_statistic  = "p99"
  threshold           = 0.5  # 500ms

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
    TargetGroup  = aws_lb_target_group.app.arn_suffix
  }

  alarm_description = <<-EOT
    Payment Service p99 Latency > 500ms

    Symptom: 1% of users experience response times > 500ms (SLO violation).
    Runbook: https://wiki.company.com/runbooks/payment-service/high-latency
  EOT

  alarm_actions = [aws_sns_topic.payment_oncall.arn]
}

Step 4: Perform an Alert Audit

#!/bin/bash
# alert-audit.sh – Check existing alerts for quality

# List all CloudWatch Alarms
aws cloudwatch describe-alarms --state-value OK,ALARM,INSUFFICIENT_DATA \
  --query 'MetricAlarms[].{
    Name: AlarmName,
    Metric: MetricName,
    Description: AlarmDescription,
    Actions: AlarmActions
  }' \
  --output table

# For each alarm, check:
# 1. Is the metric symptom-based? (not CPU, Memory)
# 2. Does the description contain a runbook URL?
# 3. Are there AlarmActions (SNS Topic)?
# 4. Was the alarm triggered in the last 90 days? Was it actionable?

Common Anti-Patterns

Anti-Pattern	Problem
CPU > 80% Alert	Not symptom-based; high CPU does not necessarily mean user impact
Alert without runbook URL	On-call engineer wakes up at 3am and does not know what to do
Alerts without alarm actions (silent alarms)	Alarms fire but nobody is notified
Thresholds set too low	Constant false positives train engineers to ignore alerts
No OK action	Engineer does not know when the problem is resolved; manual dashboard monitoring required
All alerts with the same severity	No distinction between "service down" and "anomaly detected"

Anti-Pattern

Problem

CPU > 80% Alert

Not symptom-based; high CPU does not necessarily mean user impact

Alert without runbook URL

On-call engineer wakes up at 3am and does not know what to do

Alerts without alarm actions (silent alarms)

Alarms fire but nobody is notified

Thresholds set too low

Constant false positives train engineers to ignore alerts

No OK action

Engineer does not know when the problem is resolved; manual dashboard monitoring required

All alerts with the same severity

No distinction between "service down" and "anomaly detected"

Metrics

Pages per shift: Target: < 5 pages per on-call shift (8h)
False positive rate: % of pages without an actionable response (target: < 10%)
Alert actionability rate: % of pages that led to an action (target: > 90%)
MTTR with vs. without runbook: Comparison of incident duration (from postmortems)

Maturity Levels

Level	Characteristics
Level 1	No alerts or exclusively infrastructure metrics (CPU, Memory). High alert noise.
Level 2	HTTP 5xx and service availability alerts. No runbooks linked.
Level 3	All alerts symptom-based with runbook URLs. SLOs defined. Alert noise < 10/shift.
Level 4	Burn-rate alerts for all SLOs. Alert noise metric tracked and reported.
Level 5	Alert-as-Code. Automatic anomaly detection. Alert coverage report: all services covered.

Level

Characteristics

Level 1

No alerts or exclusively infrastructure metrics (CPU, Memory). High alert noise.

Level 2

HTTP 5xx and service availability alerts. No runbooks linked.

Level 3

All alerts symptom-based with runbook URLs. SLOs defined. Alert noise < 10/shift.

Level 4

Burn-rate alerts for all SLOs. Alert noise metric tracked and reported.

Level 5

Alert-as-Code. Automatic anomaly detection. Alert coverage report: all services covered.