WAF-PERF-050 – Performance Monitoring & SLO Definition

Pillar: Performance Efficiency | Severity: High | Kategorie: Observability | Automatisierbar: Mittel–Hoch

Beschreibung

Alle Produktions-Services MÜSSEN Service Level Objectives (SLOs) für Latenz (P95, P99), Fehlerrate und Verfügbarkeit definiert haben. Service Level Indicators (SLIs) MÜSSEN kontinuierlich instrumentiert und gemessen werden. Alerting MUSS auf SLO-Burn-Rate basieren, nicht auf absoluten Durchschnittswerten. SLOs MÜSSEN quartalsweise reviewed werden.

Durchschnittswerte lügen. P99 zeigt, was echte Nutzer erleben.

Rationale

Ohne SLOs gibt es kein objektives Kriterium für "gute Performance". Teams diskutieren subjektiv. Error Budgets geben der Diskussion eine quantitative Basis: Solange das Budget vorhanden ist, kann das Team neue Features deployen. Bei Budgeterschöpfung: Stabilisierungsarbeit hat Priorität. Alerting auf Durchschnittswerte maskiert Tail-Latenz-Probleme, die 1% der Nutzer stark beeinträchtigen.

Bedrohungskontext

Risiko	Beschreibung
Graduelle Degradation unbemerkt	Ohne P99-Monitoring können Services über Wochen langsamer werden ohne dass es auffällt.
SLA-Verletzungen	Externe SLAs können nicht nachgewiesen oder überwacht werden ohne interne SLOs.
Tail-Latenz maskiert	P50 avg = 50ms, P99 = 5000ms: Durchschnitt sieht gut aus, 1% der Nutzer leiden.
Fehlende Deployment-Entscheidungsgrundlage	Ohne Error-Budget kein objektives Kriterium für "wann pausieren wir Features".

Risiko

Beschreibung

Graduelle Degradation unbemerkt

Ohne P99-Monitoring können Services über Wochen langsamer werden ohne dass es auffällt.

SLA-Verletzungen

Externe SLAs können nicht nachgewiesen oder überwacht werden ohne interne SLOs.

Tail-Latenz maskiert

P50 avg = 50ms, P99 = 5000ms: Durchschnitt sieht gut aus, 1% der Nutzer leiden.

Fehlende Deployment-Entscheidungsgrundlage

Ohne Error-Budget kein objektives Kriterium für "wann pausieren wir Features".

Anforderung

SLOs MÜSSEN für alle Produktions-Services definiert sein (P95, P99-Latenz, Fehlerrate, Verfügbarkeit)
SLIs MÜSSEN kontinuierlich instrumentiert und gemessen werden (nicht nur Stichproben)
SLO-Burn-Rate-Alerting MUSS konfiguriert sein (nicht nur statische Schwellenwerte)
SLOs MÜSSEN quartalsweise reviewed und bei Bedarf angepasst werden

Implementierungsanleitung

SLO-Dokument erstellen: docs/slos/<service>.yml mit SLI-Definition, SLO-Ziel, Error-Budget-Policy
SLIs instrumentieren: APM-Tool einrichten (X-Ray, Application Insights, Cloud Trace, Prometheus)
Percentile-Alerts konfigurieren: CloudWatch p99, Application Insights percentile-Queries
Error-Budget berechnen: 99.9% SLO = 0.1% Fehlerrate = 43.2 min Ausfallzeit/Monat
Multi-Window Burn Rate Alerts: 1h/6h/24h-Windows nach Google SRE-Methodik
Dashboard erstellen: SLO-Compliance, Error-Budget-Status, Burn-Rate-Trend
Quarterly Review einplanen: SLOs anpassen wenn Targets nicht mehr repräsentativ sind

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Kein SLO	Nur Availability-Monitoring (up/down); keine Latenz-Baselines; Incidents durch Nutzer entdeckt.
2	Informelle Targets	Latenz gesammelt aber keine formalen SLOs; Durchschnittswert-Alerting; keine Error Budgets.
3	Formale SLOs	SLOs dokumentiert; P99-Alerting; SLIs instrumentiert; Error Budgets berechnet.
4	Error Budget Management	Deployment-Gates bei Budget-Erschöpfung; SLO in Quarterly Engineering Reviews.
5	Prädiktives SLO-Management	Burn-Rate-Prediction; automatische Kapazitätsanpassung bei drohendem Budget-Breach.

Level

Bezeichnung

Kriterien

Kein SLO

Nur Availability-Monitoring (up/down); keine Latenz-Baselines; Incidents durch Nutzer entdeckt.

Informelle Targets

Latenz gesammelt aber keine formalen SLOs; Durchschnittswert-Alerting; keine Error Budgets.

Formale SLOs

SLOs dokumentiert; P99-Alerting; SLIs instrumentiert; Error Budgets berechnet.

Error Budget Management

Deployment-Gates bei Budget-Erschöpfung; SLO in Quarterly Engineering Reviews.

Prädiktives SLO-Management

Burn-Rate-Prediction; automatische Kapazitätsanpassung bei drohendem Budget-Breach.

Terraform Checks

waf-perf-050.tf.aws.cloudwatch-latency-alarm

Prüft: CloudWatch-Alarms müssen Alarm-Actions, mehrere Evaluation-Periods und Beschreibung haben.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_metric_alarm" "p99" { alarm_name = "payment-api-p99" alarm_description = "P99 > 500ms. Runbook: https://wiki/perf" metric_name = "TargetResponseTime" statistic = "p99" threshold = 0.5 evaluation_periods = 3 alarm_actions = [aws_sns_topic.alerts.arn] }`	`resource "aws_cloudwatch_metric_alarm" "latency" { alarm_name = "latency" statistic = "Average" # Avg statt p99 threshold = 1.0 evaluation_periods = 1 # Keine alarm_actions # WAF-PERF-050 Violation }`

resource "aws_cloudwatch_metric_alarm" "p99" {
  alarm_name         = "payment-api-p99"
  alarm_description  = "P99 > 500ms. Runbook: https://wiki/perf"
  metric_name        = "TargetResponseTime"
  statistic          = "p99"
  threshold          = 0.5
  evaluation_periods = 3
  alarm_actions      = [aws_sns_topic.alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "latency" {
  alarm_name = "latency"
  statistic  = "Average"  # Avg statt p99
  threshold  = 1.0
  evaluation_periods = 1
  # Keine alarm_actions
  # WAF-PERF-050 Violation
}

Remediation: Auf p99-Statistik wechseln; alarm_actions mit SNS-Topic setzen; evaluation_periods >= 2; alarm_description mit Runbook-Link hinzufügen.

Evidenz

Typ	Pflicht	Beschreibung
Governance	✅ Pflicht	SLO-Dokument für alle Produktions-Services (SLI, SLO-Ziel, Error-Budget-Policy).
Config	✅ Pflicht	Monitoring-/APM-Konfiguration mit SLI-Instrumentierung und SLO-Alerting.
Config	Optional	SLO-Compliance-Dashboard mit historischen Trends.
Process	Optional	Quarterly-SLO-Review-Meeting-Protokoll.

Typ

Pflicht

Beschreibung

Governance

✅ Pflicht

SLO-Dokument für alle Produktions-Services (SLI, SLO-Ziel, Error-Budget-Policy).

Config

✅ Pflicht

Monitoring-/APM-Konfiguration mit SLI-Instrumentierung und SLO-Alerting.

Config

Optional

SLO-Compliance-Dashboard mit historischen Trends.

Process

Optional

Quarterly-SLO-Review-Meeting-Protokoll.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 25010:2011	8.3.2 – Performance efficiency; 8.3.2.1 – Time behaviour; 8.3.2.2 – Resource utilisation; 8.3.2.3 – Capacity
AWS Well-Architected Framework	Performance Efficiency Pillar – Select the right resource types and sizes
Azure Well-Architected Framework	Performance Efficiency – Choose the right resources
Google Cloud Architecture Framework	Performance optimization – Right-size your instances
TOGAF 10	ADM Phase B – Business architecture; ADM Phase C – Application architecture
DORA	DORA 2024 – Technical practices; DORA 2024 – Performance monitoring
ISO/IEC 29119	4.4.3 – Test design techniques; 4.5.3 – Test execution
ISO/IEC 12207	8.2.2.3 – Design and development of software
ITIL 4	SVS – Service value system; DP – Design principle
BSI C5:2020	OPS-01 – Operational monitoring; OPS-02 – Operational control
CIS Controls v8	CIS 8 – Continuous Vulnerability Management
NIST SP 800-53	RA-1 – Security assessment policy; RA-2 – Security assessment controls
NIST CSF 2.0	DE.CM – Continuous monitoring; DE.AE – Anomaly detection
FedRAMP	RA-2, RA-5 (Moderate/High baseline)
SOC 2 Type II	CC6.1 – Logical access security software; CC7.1 – Infrastructure and software monitoring
TISAX	Information security – Performance monitoring
ANSSI SecNumCloud	Domain – Performance monitoring
BIO	BIO – Prestatiedoelstellingen
ENS High	op.exp.2 – Configuración de seguridad
UK NCSC CAF	B4 – System security; B5 – System performance
CMMC 2.0	RA.L2-3.8.1 – Automated monitoring
IRAP	ISM – Performance monitoring
CCCS PBMM	RA-2 – Security assessment controls; RA-5 – Security assessments
MAS TRM	Ch.5 – Technology risk governance
ISMAP	Performance monitoring and validation
FISC	Technical measures – Performance monitoring

Framework

Controls

ISO/IEC 25010:2011

8.3.2 – Performance efficiency; 8.3.2.1 – Time behaviour; 8.3.2.2 – Resource utilisation; 8.3.2.3 – Capacity

AWS Well-Architected Framework

Performance Efficiency Pillar – Select the right resource types and sizes

Azure Well-Architected Framework

Performance Efficiency – Choose the right resources

Google Cloud Architecture Framework

Performance optimization – Right-size your instances

TOGAF 10

ADM Phase B – Business architecture; ADM Phase C – Application architecture

DORA

DORA 2024 – Technical practices; DORA 2024 – Performance monitoring

ISO/IEC 29119

4.4.3 – Test design techniques; 4.5.3 – Test execution

ISO/IEC 12207

8.2.2.3 – Design and development of software

ITIL 4

SVS – Service value system; DP – Design principle

BSI C5:2020

OPS-01 – Operational monitoring; OPS-02 – Operational control

CIS Controls v8

CIS 8 – Continuous Vulnerability Management

NIST SP 800-53

RA-1 – Security assessment policy; RA-2 – Security assessment controls

NIST CSF 2.0

DE.CM – Continuous monitoring; DE.AE – Anomaly detection

FedRAMP

RA-2, RA-5 (Moderate/High baseline)

SOC 2 Type II

CC6.1 – Logical access security software; CC7.1 – Infrastructure and software monitoring

TISAX

Information security – Performance monitoring

ANSSI SecNumCloud

Domain – Performance monitoring

BIO

BIO – Prestatiedoelstellingen

ENS High

op.exp.2 – Configuración de seguridad

UK NCSC CAF

B4 – System security; B5 – System performance

CMMC 2.0

RA.L2-3.8.1 – Automated monitoring

IRAP

ISM – Performance monitoring

CCCS PBMM

RA-2 – Security assessment controls; RA-5 – Security assessments

MAS TRM

Ch.5 – Technology risk governance

ISMAP

Performance monitoring and validation

FISC

Technical measures – Performance monitoring

WAF-PERF-050 – Performance Monitoring & SLO Definition

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-perf-050.tf.aws.cloudwatch-latency-alarm

Evidenz

Regulatorisches Mapping

Verwandte Controls