WAF-OPS-030 – Observability Stack Configured

Pillar: Operational Excellence | Severity: High | Kategorie: Observability | Automatisierbar: Hoch

Beschreibung

Jeder Produktions-Workload MUSS strukturierte Logs emittieren, Metriken exponieren und Distributed Tracing unterstützen. Ein zentralisierter Observability-Stack MUSS konfiguriert sein. Logs MÜSSEN in JSON strukturiert sein, Trace-IDs enthalten und eine definierte Retention-Policy haben.

Rationale

Nicht beobachtbare Systeme können nicht zuverlässig betrieben werden. Im Incident-Fall müssen Engineers in Minuten bestimmen können was passiert ist, warum und wann. Structured Logging ermöglicht automatisiertes Parsing. Metriken ermöglichen Trend-Analyse. Distributed Tracing ermöglicht Root-Cause-Isolation in Microservices. Ohne alle drei Säulen ist MTTR hoch und Post-Incident-Learning eingeschränkt.

Bedrohungskontext

Risiko	Beschreibung
Blinde Incidents	Ohne strukturierte Logs dauert die Fehlerdiagnose Stunden statt Minuten.
Stille Fehler	Ohne Metriken werden Fehler erst durch Nutzerberichte entdeckt, nicht durch Alerting.
Root-Cause-Unmöglichkeit	Ohne Distributed Tracing in Microservices ist die Ursache eines Fehlers nicht isolierbar.
Security-Blind-Spots	Ohne zentrale Log-Aggregation sind Security-Incidents nicht erkennbar.

Risiko

Beschreibung

Blinde Incidents

Ohne strukturierte Logs dauert die Fehlerdiagnose Stunden statt Minuten.

Stille Fehler

Ohne Metriken werden Fehler erst durch Nutzerberichte entdeckt, nicht durch Alerting.

Root-Cause-Unmöglichkeit

Ohne Distributed Tracing in Microservices ist die Ursache eines Fehlers nicht isolierbar.

Security-Blind-Spots

Ohne zentrale Log-Aggregation sind Security-Incidents nicht erkennbar.

Anforderung

Alle Services MÜSSEN strukturierte JSON-Logs mit trace_id, request_id, service emittieren
Distributed Tracing MUSS konfiguriert und instrumentiert sein
RED-Metriken MÜSSEN für jeden Service exportiert sein (Rate, Errors, Duration)
Log-Retention MUSS mindestens 30 Tage (empfohlen: 90 Tage Applikation, 365 Tage Audit)
Sensitive Daten (PII, Credentials) DÜRFEN NICHT in Logs erscheinen

Implementierungsanleitung

Structured Logging einrichten: JSON-Logging-Framework; trace_id, service, level, timestamp als Pflichtfelder
OpenTelemetry SDK integrieren: Auto-Instrumentation für HTTP-Clients und -Server; Trace-ID-Propagation
Log-Groups mit Retention konfigurieren: CloudWatch Log Groups in Terraform mit retention_in_days
RED-Metriken exportieren: Request Rate, Error Rate, Duration (p50/p95/p99) als Custom Metrics
Dashboards erstellen: Service-Health-Dashboard mit allen RED-Metriken
X-Ray/Jaeger aktivieren: Tracing Backend konfigurieren; Sampling-Rate optimieren (5-10% unter Last)

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Unstrukturiertes Logging	Text-Logs zu stdout; kein zentrales Aggregation; kein Tracing; kein Monitoring.
2	Zentralisierte Logs	Logs zentral aggregiert; Basis-Dashboards; kein Distributed Tracing.
3	Alle 3 Säulen konfiguriert	Structured Logs mit Trace-ID; Distributed Tracing; RED-Metriken; Log-Retention-Policy.
4	SLO-basiertes Alerting	Burn-Rate-Alerts; Sampling optimiert; automatische Anomalie-Erkennung.
5	OpenTelemetry Plattform	Vendor-agnostisch; volle Korrelation Logs/Traces/Metriken; Observability-as-Product.

Level

Bezeichnung

Kriterien

Unstrukturiertes Logging

Text-Logs zu stdout; kein zentrales Aggregation; kein Tracing; kein Monitoring.

Zentralisierte Logs

Logs zentral aggregiert; Basis-Dashboards; kein Distributed Tracing.

Alle 3 Säulen konfiguriert

Structured Logs mit Trace-ID; Distributed Tracing; RED-Metriken; Log-Retention-Policy.

SLO-basiertes Alerting

Burn-Rate-Alerts; Sampling optimiert; automatische Anomalie-Erkennung.

OpenTelemetry Plattform

Vendor-agnostisch; volle Korrelation Logs/Traces/Metriken; Observability-as-Product.

Terraform Checks

waf-ops-030.tf.aws.cloudwatch-log-group-retention

Prüft: CloudWatch Log Groups haben eine Retention-Policy (mindestens 30 Tage).

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_cloudwatch_log_group" "app" { name = "/aws/ecs/payment-service" retention_in_days = 90 tags = { environment = "production" } }`	`resource "aws_cloudwatch_log_group" "app" { name = "/aws/ecs/payment-service" # Kein retention_in_days # Logs akkumulieren unbegrenzt # WAF-OPS-030 Violation }`

resource "aws_cloudwatch_log_group" "app" {
  name              = "/aws/ecs/payment-service"
  retention_in_days = 90
  tags = {
    environment = "production"
  }
}

resource "aws_cloudwatch_log_group" "app" {
  name = "/aws/ecs/payment-service"
  # Kein retention_in_days
  # Logs akkumulieren unbegrenzt
  # WAF-OPS-030 Violation
}

waf-ops-030.tf.aws.xray-tracing-enabled

Prüft: Lambda Functions haben X-Ray Tracing auf "Active" gesetzt.

# Compliant
resource "aws_lambda_function" "processor" {
  function_name = "payment-processor"
  tracing_config {
    mode = "Active"  # Nicht PassThrough
  }
}

Remediation: Alle aws_cloudwatch_log_group Ressourcen mit retention_in_days >= 30 versehen. Lambda Functions tracing_config { mode = "Active" } hinzufügen.

Evidenz

Typ	Pflicht	Beschreibung
Config	✅ Pflicht	Log-Group-Konfiguration mit definierten Retention-Policies in Terraform.
IaC	✅ Pflicht	Tracing-Konfiguration (X-Ray, OTel Collector) als IaC-Ressource.
Process	Optional	Dashboard-Screenshot mit RED-Metriken für kritische Services.
Config	Optional	OpenTelemetry SDK-Konfiguration im Applikationscode (Sampling-Konfiguration).

Typ

Pflicht

Beschreibung

Config

✅ Pflicht

Log-Group-Konfiguration mit definierten Retention-Policies in Terraform.

IaC

✅ Pflicht

Tracing-Konfiguration (X-Ray, OTel Collector) als IaC-Ressource.

Process

Optional

Dashboard-Screenshot mit RED-Metriken für kritische Services.

Config

Optional

OpenTelemetry SDK-Konfiguration im Applikationscode (Sampling-Konfiguration).

Regulatorisches Mapping

Framework	Controls
ISO/IEC 20000-1:2018	8.2.3 – Change management; 8.3.4 – Release management; 10.2.2 – Financial management
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain
AWS Well-Architected Framework	Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy
DORA	DORA 2024 – Technical practices; DORA 2024 – Organizational culture
SOC 2 Type II	CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring
Google SRE Book	Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives
PCI DSS v4.0	Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices
FinOps Foundation	Core Module – Financial accountability; Management Layer – Cost governance
BSI C5:2020	OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity
NIST SP 800-53	CM-1 – Configuration management policy; CM-2 – Configuration management
NIST CSF 2.0	GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring
TISAX	Information security – Change management
ANSSI SecNumCloud	Domain – Change management
BIO	BIO – Veranderingenbeheer
ENS High	op.exp.6 – Gestión de cambios
UK NCSC CAF	A4 – Policy and assurance; A5 – Continual improvement
CMMC 2.0	CM.L2-3.4.1 – Establish baseline configurations
IRAP	ISM – Change management
CCCS PBMM	CM-2 – Baseline configuration; CA-7 – Continuous monitoring
MAS TRM	Ch.3 – Technology risk governance; Ch.9 – Change management
ISMAP	Operational excellence and continuous improvement
FISC	Operational measures – Change management

Framework

Controls

ISO/IEC 20000-1:2018

8.2.3 – Change management; 8.3.4 – Release management; 10.2.2 – Financial management

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain

AWS Well-Architected Framework

Operational Excellence Pillar – Prepare; Operational Excellence Pillar – Deploy

DORA

DORA 2024 – Technical practices; DORA 2024 – Organizational culture

SOC 2 Type II

CC4.1 – Monitoring activities; CC7.1 – Infrastructure and software monitoring

Google SRE Book

Chapter 2 – SRE: The role of an SRE; Chapter 3 – Service Level Objectives

PCI DSS v4.0

Req 6.4 – Secure development lifecycle; Req 6.5 – Secure coding practices

FinOps Foundation

Core Module – Financial accountability; Management Layer – Cost governance

BSI C5:2020

OPS-01 – Operational monitoring; OPS-02 – Operational control; OPS-03 – Operational capacity

NIST SP 800-53

CM-1 – Configuration management policy; CM-2 – Configuration management

NIST CSF 2.0

GV.PO – Policy; RC.RP – Recovery planning; DE.CM – Continuous monitoring

TISAX

Information security – Change management

ANSSI SecNumCloud

Domain – Change management

BIO

BIO – Veranderingenbeheer

ENS High

op.exp.6 – Gestión de cambios

UK NCSC CAF

A4 – Policy and assurance; A5 – Continual improvement

CMMC 2.0

CM.L2-3.4.1 – Establish baseline configurations

IRAP

ISM – Change management

CCCS PBMM

CM-2 – Baseline configuration; CA-7 – Continuous monitoring

MAS TRM

Ch.3 – Technology risk governance; Ch.9 – Change management

ISMAP

Operational excellence and continuous improvement

FISC

Operational measures – Change management

WAF-OPS-030 – Observability Stack Configured

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-ops-030.tf.aws.cloudwatch-log-group-retention

waf-ops-030.tf.aws.xray-tracing-enabled

Evidenz

Regulatorisches Mapping

Verwandte Controls