Best Practice: Alerting on Symptoms, Not Causes
Context
The most common cause of on-call burnout is not the work itself – it is the wrong alerts. A team that receives 50 alerts per night, of which 45 are not actionable, will soon ignore all alerts.
Symptom-based alerting is the answer: only alert when users are affected.
Target State
A mature alerting system:
-
Every paging alert is symptom-based (error rate, latency, availability)
-
Every alert has a runbook URL
-
On-call engineers receive < 5 pages per shift (with 0 false positives)
-
SLOs are defined and burn-rate-based alerts are configured
-
Alert noise metric is tracked weekly
Technical Implementation
Step 1: Define SLOs
# slo-definitions.yaml (store in version control)
services:
payment-service:
slos:
- name: availability
description: "Availability of the Payment Service"
target: 99.9 # 99.9% = 8.6 hours of downtime/year allowed
window: 30d
metric:
good_events: "http_requests_total{service='payment',code!~'5..'}"
total_events: "http_requests_total{service='payment'}"
- name: latency-p99
description: "p99 latency for payment requests"
target: 99.0 # 99% of requests under 500ms
window: 30d
metric:
good_events: "http_request_duration_seconds_bucket{service='payment',le='0.5'}"
total_events: "http_requests_total{service='payment'}"
- name: error-rate
description: "Error rate of all requests"
target: 99.9 # < 0.1% error rate
window: 30d
metric:
good_events: "http_requests_total{service='payment',code!~'5..'}"
total_events: "http_requests_total{service='payment'}"
Step 2: Configure Burn-Rate Alerts (Prometheus/AlertManager)
# prometheus-alerts.yaml
groups:
- name: payment-service-slo
rules:
# Fast-Burn Alert: Page immediately when error budget burns 5x faster
- alert: PaymentServiceHighErrorBudgetBurn
expr: |
(
rate(http_requests_total{service="payment",code=~"5.."}[1h])
/
rate(http_requests_total{service="payment"}[1h])
) > (5 * 0.001) # 5x burn rate, SLO target = 99.9% -> 0.1% error budget
for: 2m
labels:
severity: critical
team: payment
annotations:
summary: "Payment Service high error budget burn rate"
description: "Error budget burning 5x faster than allowed. At this rate, monthly budget exhausted in {{ $value | humanizeDuration }}."
runbook_url: "https://wiki.company.com/runbooks/payment-service/error-budget-burn"
# Slow-Burn Alert: Create ticket (no page) when budget burns slowly
- alert: PaymentServiceSlowErrorBudgetBurn
expr: |
(
rate(http_requests_total{service="payment",code=~"5.."}[6h])
/
rate(http_requests_total{service="payment"}[6h])
) > (2 * 0.001) # 2x burn rate over 6 hours
for: 60m
labels:
severity: warning
team: payment
annotations:
summary: "Payment Service slow error budget burn"
description: "Error budget burning 2x expected rate. Review and address within 24 hours."
runbook_url: "https://wiki.company.com/runbooks/payment-service/slow-error-burn"
# Latency SLO Alert
- alert: PaymentServiceHighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{service="payment"}[5m])
) > 0.5 # p99 > 500ms
for: 5m
labels:
severity: critical
annotations:
summary: "Payment Service p99 latency above SLO"
description: "p99 latency is {{ $value | humanizeDuration }}, exceeding 500ms SLO threshold."
runbook_url: "https://wiki.company.com/runbooks/payment-service/high-latency"
Step 3: Configure CloudWatch Symptom Alarm with Terraform
# Symptom-based alert: HTTP 5xx Error Rate
resource "aws_cloudwatch_metric_alarm" "payment_error_rate" {
alarm_name = "payment-service-5xx-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
datapoints_to_alarm = 2
threshold = 10
treat_missing_data = "notBreaching"
metric_query {
id = "error_rate"
expression = "errors / total * 100"
label = "Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
stat = "Sum"
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
TargetGroup = aws_lb_target_group.app.arn_suffix
}
}
}
metric_query {
id = "total"
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 60
stat = "Sum"
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
TargetGroup = aws_lb_target_group.app.arn_suffix
}
}
}
alarm_description = <<-EOT
Payment Service 5xx Error Rate > 10%
Symptom: Users are receiving HTTP 5xx errors from the Payment Service.
Runbook: https://wiki.company.com/runbooks/payment-service/5xx-errors
Dashboard: https://monitoring.company.com/d/payment-service
Escalation: payment-oncall@company.com → platform-team@company.com
EOT
alarm_actions = [aws_sns_topic.payment_oncall.arn]
ok_actions = [aws_sns_topic.payment_oncall.arn]
}
# Latency Alert (Symptom: Slow response times for users)
resource "aws_cloudwatch_metric_alarm" "payment_latency_p99" {
alarm_name = "payment-service-p99-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
datapoints_to_alarm = 2
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 60
extended_statistic = "p99"
threshold = 0.5 # 500ms
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
TargetGroup = aws_lb_target_group.app.arn_suffix
}
alarm_description = <<-EOT
Payment Service p99 Latency > 500ms
Symptom: 1% of users experience response times > 500ms (SLO violation).
Runbook: https://wiki.company.com/runbooks/payment-service/high-latency
EOT
alarm_actions = [aws_sns_topic.payment_oncall.arn]
}
Step 4: Perform an Alert Audit
#!/bin/bash
# alert-audit.sh – Check existing alerts for quality
# List all CloudWatch Alarms
aws cloudwatch describe-alarms --state-value OK,ALARM,INSUFFICIENT_DATA \
--query 'MetricAlarms[].{
Name: AlarmName,
Metric: MetricName,
Description: AlarmDescription,
Actions: AlarmActions
}' \
--output table
# For each alarm, check:
# 1. Is the metric symptom-based? (not CPU, Memory)
# 2. Does the description contain a runbook URL?
# 3. Are there AlarmActions (SNS Topic)?
# 4. Was the alarm triggered in the last 90 days? Was it actionable?
Common Anti-Patterns
| Anti-Pattern | Problem |
|---|---|
CPU > 80% Alert |
Not symptom-based; high CPU does not necessarily mean user impact |
Alert without runbook URL |
On-call engineer wakes up at 3am and does not know what to do |
Alerts without alarm actions (silent alarms) |
Alarms fire but nobody is notified |
Thresholds set too low |
Constant false positives train engineers to ignore alerts |
No OK action |
Engineer does not know when the problem is resolved; manual dashboard monitoring required |
All alerts with the same severity |
No distinction between "service down" and "anomaly detected" |
Metrics
-
Pages per shift: Target: < 5 pages per on-call shift (8h)
-
False positive rate: % of pages without an actionable response (target: < 10%)
-
Alert actionability rate: % of pages that led to an action (target: > 90%)
-
MTTR with vs. without runbook: Comparison of incident duration (from postmortems)
Maturity Levels
| Level | Characteristics |
|---|---|
Level 1 |
No alerts or exclusively infrastructure metrics (CPU, Memory). High alert noise. |
Level 2 |
HTTP 5xx and service availability alerts. No runbooks linked. |
Level 3 |
All alerts symptom-based with runbook URLs. SLOs defined. Alert noise < 10/shift. |
Level 4 |
Burn-rate alerts for all SLOs. Alert noise metric tracked and reported. |
Level 5 |
Alert-as-Code. Automatic anomaly detection. Alert coverage report: all services covered. |