WAF-OPS-040 – Alerting on Symptoms, Not Causes

Pillar: Operational Excellence | Severity: High | Kategorie: Alerting | Automatisierbar: Hoch

Beschreibung

Alle Produktions-Alerts MÜSSEN auf nutzer-sichtbaren Symptomen basieren (Fehlerrate, Latenz, Verfügbarkeit), nicht auf internen Ursachen (CPU, Memory). Jeder Alert der On-Call-Engineers paged MUSS actionable sein und MUSS eine Runbook-URL enthalten. Alert-Fatigue MUSS aktiv gemessen und reduziert werden.

Rationale

Ursachen-basiertes Alerting erzeugt Noise: Ein CPU-Alert > 80% bedeutet nicht zwingend eine Nutzerauswirkung. Symptom-basiertes Alerting stellt sicher, dass Engineers nur dann gerufen werden wenn Nutzer tatsächlich betroffen sind. Alert-Fatigue ist die Hauptursache für On-Call-Burnout und übergangene kritische Alerts.

Bedrohungskontext

Risiko	Beschreibung
Alert-Fatigue	Zu viele non-actionable Alerts trainieren Engineers, Alerts zu ignorieren oder zu snoozen.
Verpasste Nutzerauswirkung	Nur Infrastruktur-Metriken überwacht: Service komplett down aber kein Alert ausgelöst.
On-Call-Burnout	Regelmäßige 3-Uhr-Pages für nicht-actionable CPU-Alerts zerstören On-Call-Rotation.
Fehlende Runbooks	Engineer geweckt um 3 Uhr ohne Runbook – Diagnose auf der Basis von Intuition.

Risiko

Beschreibung

Alert-Fatigue

Zu viele non-actionable Alerts trainieren Engineers, Alerts zu ignorieren oder zu snoozen.

Verpasste Nutzerauswirkung

Nur Infrastruktur-Metriken überwacht: Service komplett down aber kein Alert ausgelöst.

On-Call-Burnout

Regelmäßige 3-Uhr-Pages für nicht-actionable CPU-Alerts zerstören On-Call-Rotation.

Fehlende Runbooks

Engineer geweckt um 3 Uhr ohne Runbook – Diagnose auf der Basis von Intuition.

Anforderung

Alle kritischen Alerts MÜSSEN symptom-basiert sein (Fehlerrate, Latenz, Verfügbarkeit)
Jeder paging Alert MUSS eine Runbook-URL in der Beschreibung/Annotation enthalten
SLOs MÜSSEN für alle kritischen Services definiert sein
Alert-Noise-Metrik MUSS tracked werden (Ziel: < 10 Pages/Woche/Ingenieur)
Quarterly Alert-Audit: nicht-actionable Alerts werden entfernt oder stummgeschaltet

Implementierungsanleitung

SLOs definieren: Availability (z.B. 99.9%), Latenz p99 (z.B. < 500ms), Fehlerrate (< 0.1%)
Burn-Rate-Alerts konfigurieren: Fast-Burn sofort pagen; Slow-Burn als Ticket
Alert-Audit durchführen: Alle bestehenden Alerts: Symptom-basiert? Actionable? Runbook-URL?
Runbook-URLs in Alerts: alarm_description (CloudWatch), runbook_url (Prometheus), description (Azure Monitor)
Alert-Noise-Tracking: PagerDuty/OpsGenie Analytics; Pages pro Schicht tracken
Quarterly Review: Alerts ohne Aktion in letzten 90 Tagen deaktivieren oder löschen

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Kein oder Noise-Alerting	Keine Alerts oder ausschließlich Infrastructure-Metriken (CPU, Memory). Hohe Alert-Noise.
2	Basis-Service-Alerts	HTTP 5xx und Service-Availability Alerts konfiguriert. Keine Runbooks verlinkt.
3	Symptom-basiert mit Runbooks	Alle kritischen Alerts symptom-basiert. Runbook-URLs in allen Alerts. SLOs definiert.
4	Burn-Rate-Alerting	Burn-Rate-Alerts für alle SLOs. Alert-Noise-Metrik < 10/Schicht. Quarterly Review.
5	Automatische Alert-Optimierung	Alert-as-Code. ML-basierte Anomalie-Erkennung. 100% Alert-Coverage für kritische Services.

Level

Bezeichnung

Kriterien

Kein oder Noise-Alerting

Keine Alerts oder ausschließlich Infrastructure-Metriken (CPU, Memory). Hohe Alert-Noise.

Basis-Service-Alerts

HTTP 5xx und Service-Availability Alerts konfiguriert. Keine Runbooks verlinkt.

Symptom-basiert mit Runbooks

Alle kritischen Alerts symptom-basiert. Runbook-URLs in allen Alerts. SLOs definiert.

Burn-Rate-Alerting

Burn-Rate-Alerts für alle SLOs. Alert-Noise-Metrik < 10/Schicht. Quarterly Review.

Automatische Alert-Optimierung

Alert-as-Code. ML-basierte Anomalie-Erkennung. 100% Alert-Coverage für kritische Services.

Terraform Checks

waf-ops-040.tf.aws.cloudwatch-alarm-symptom-based

Prüft: CloudWatch Alarm hat nicht-leere alarm_description mit Runbook-URL und alarm_actions.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "errors" { alarm_name = "payment-5xx-rate" metric_name = "5XXError" namespace = "AWS/ApplicationELB" alarm_description = "Payment 5xx > 10/min. Runbook: https://wiki/runbooks/5xx" alarm_actions = [aws_sns_topic.oncall.arn] /* ... */ }`	`resource "aws_cloudwatch_metric_alarm" "cpu" { alarm_name = "high-cpu" metric_name = "CPUUtilization" namespace = "AWS/EC2" threshold = 80 # Kein alarm_description mit Runbook # Kein alarm_actions # WAF-OPS-040 Violation }`

resource "aws_cloudwatch_metric_alarm" "errors" {
  alarm_name    = "payment-5xx-rate"
  metric_name   = "5XXError"
  namespace     = "AWS/ApplicationELB"
  alarm_description = "Payment 5xx > 10/min. Runbook: https://wiki/runbooks/5xx"
  alarm_actions = [aws_sns_topic.oncall.arn]
  /* ... */
}

resource "aws_cloudwatch_metric_alarm" "cpu" {
  alarm_name  = "high-cpu"
  metric_name = "CPUUtilization"
  namespace   = "AWS/EC2"
  threshold   = 80
  # Kein alarm_description mit Runbook
  # Kein alarm_actions
  # WAF-OPS-040 Violation
}

waf-ops-040.tf.azurerm.monitor-alert-symptom

Prüft: Azure Monitor Alert hat description mit Runbook-Hinweis und action mit Action Group.

# Compliant
resource "azurerm_monitor_metric_alert" "error_rate" {
  name                = "payment-error-rate-high"
  resource_group_name = azurerm_resource_group.main.name
  scopes              = [azurerm_application_insights.main.id]
  description         = "Payment error rate > 1%. Runbook: https://wiki/runbooks/payment-errors"
  action {
    action_group_id = azurerm_monitor_action_group.oncall.id
  }
  /* criteria block ... */
}

Remediation: alarm_description mit Runbook-URL versehen. alarm_actions auf SNS-Topic setzen. Symptom-basierte Metriken verwenden (Fehlerrate, Latenz statt CPU/Memory).

Evidenz

Typ	Pflicht	Beschreibung
Config	✅ Pflicht	Alert-Definitionen mit symptom-basierten Metriken und Runbook-URLs.
Governance	✅ Pflicht	SLO-Definitionen für alle kritischen Services (Availability, Latenz, Fehlerrate-Targets).
Process	Optional	Monatlicher Alert-Noise-Report (Pages pro Woche, Actionability-Rate).
Config	Optional	Runbook-Index mit Coverage-Nachweis für alle paging Alerts.

Typ

Pflicht

Beschreibung

Config

✅ Pflicht

Alert-Definitionen mit symptom-basierten Metriken und Runbook-URLs.

Governance

✅ Pflicht

SLO-Definitionen für alle kritischen Services (Availability, Latenz, Fehlerrate-Targets).

Process

Optional

Monatlicher Alert-Noise-Report (Pages pro Woche, Actionability-Rate).

Config

Optional

Runbook-Index mit Coverage-Nachweis für alle paging Alerts.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 20000-1:2018	8.2.3 – Change management; 8.3.4 – Release management; 8.4.1 – Service delivery; 8.4.2 – Service reporting
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy; Operational Excellence Pillar – Monitor; Operational Excellence Pillar – Improve
DORA	DORA 2024 – Technical practices; DORA 2024 – Organizational culture
SOC 2 Type II	CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring; CC7.2 – Evaluation of system changes
Google SRE Book	Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives; Chapter 4 – Eliminating toil
PCI DSS v4.0	Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices; Req 6.6 – Application security
FinOps Foundation	Core Module – Financial accountability; Management Layer – Cost governance
BSI C5:2020	OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity; OPS-04 – Change management
NIST SP 800-53	CM-1 – Configuration management policy; CM-2 – Configuration management; CM-6 – Configuration settings; CM-8 – Information system integration; CA-7 – Continuous monitoring
NIST CSF 2.0	GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring
TISAX	Information security – Change management
ANSSI SecNumCloud	Domain – Change management
BIO	BIO – Veranderingenbeheer
ENS High	op.exp.6 – Gestión de cambios
UK NCSC CAF	A4 – Policy and assurance; A5 – Continual improvement
CMMC 2.0	CM.L2-3.4.1 – Establish baseline configurations; CA.L2-3.12.1 – Periodically assess security controls
IRAP	ISM – Change management
CCCS PBMM	CM-2 – Baseline configuration; CA-7 – Continuous monitoring
MAS TRM	Ch.3 – Technology risk governance; Ch.9 – Change management
ISMAP	Operational excellence and continuous improvement
FISC	Operational measures – Change management

Framework

Controls

ISO/IEC 20000-1:2018

8.2.3 – Change management; 8.3.4 – Release management; 8.4.1 – Service delivery; 8.4.2 – Service reporting

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy; Operational Excellence Pillar – Monitor; Operational Excellence Pillar – Improve

DORA

DORA 2024 – Technical practices; DORA 2024 – Organizational culture

SOC 2 Type II

CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring; CC7.2 – Evaluation of system changes

Google SRE Book

Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives; Chapter 4 – Eliminating toil

PCI DSS v4.0

Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices; Req 6.6 – Application security

FinOps Foundation

Core Module – Financial accountability; Management Layer – Cost governance

BSI C5:2020

OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity; OPS-04 – Change management

NIST SP 800-53

CM-1 – Configuration management policy; CM-2 – Configuration management; CM-6 – Configuration settings; CM-8 – Information system integration; CA-7 – Continuous monitoring

NIST CSF 2.0

GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring

TISAX

Information security – Change management

ANSSI SecNumCloud

Domain – Change management

BIO

BIO – Veranderingenbeheer

ENS High

op.exp.6 – Gestión de cambios

UK NCSC CAF

A4 – Policy and assurance; A5 – Continual improvement

CMMC 2.0

CM.L2-3.4.1 – Establish baseline configurations; CA.L2-3.12.1 – Periodically assess security controls

IRAP

ISM – Change management

CCCS PBMM

CM-2 – Baseline configuration; CA-7 – Continuous monitoring

MAS TRM

Ch.3 – Technology risk governance; Ch.9 – Change management

ISMAP

Operational excellence and continuous improvement

FISC

Operational measures – Change management

Best Practice

Alerting auf Symptome statt Ursachen →

WAF-OPS-040 – Alerting on Symptoms, Not Causes

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-ops-040.tf.aws.cloudwatch-alarm-symptom-based

waf-ops-040.tf.azurerm.monitor-alert-symptom

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice