WAF-REL-020 – Health Checks & Readiness Probes Configured

Pillar: Reliability | Severity: High | Kategorie: Health Monitoring | Automatisierbar: Hoch

Beschreibung

Alle Produktions-Services MÜSSEN Health Check Endpoints exponieren und Readiness-/Liveness-Probes konfigurieren. Load Balancer MÜSSEN Health Checks mit explizit konfigurierten Pfaden, Intervallen und Schwellenwerten nutzen – keine Cloud-Provider-Defaults.

Kein Deployment ohne funktionierende Health Checks. Dies ist eine nicht-verhandelbare Voraussetzung für automatisches Failover und Zero-Downtime Deployments.

Rationale

Ohne Health Checks werden fehlerhafte Instanzen weiterhin mit Traffic versorgt. Kubernetes ohne Readiness Probe sendet Traffic an Pods, die noch nicht bereit sind oder bereits ausgefallen sind. Load Balancer ohne explizite Health Check Konfiguration verwenden Defaults, die oft zu tolerant (30s Interval, keine Path-Validierung) sind.

Bedrohungskontext

Risiko	Beschreibung
Traffic auf fehlerhafte Instanzen	Ohne Health Check routet LB weiterhin Requests an Instanzen, die Fehler zurückgeben.
Deadlock unentdeckt	Ohne Liveness Probe läuft ein deadlockter Prozess ewig weiter und blockiert Ressourcen.
Premature Traffic	Ohne Readiness Probe erhält ein nicht-initialisierter Pod Traffic und erzeugt Fehler.
Default-Timeout zu tolerant	Cloud-Provider-Defaults sind oft 30s Interval und 3 Fehlschläge – zu lang für schnelle Recovery.

Risiko

Beschreibung

Traffic auf fehlerhafte Instanzen

Ohne Health Check routet LB weiterhin Requests an Instanzen, die Fehler zurückgeben.

Deadlock unentdeckt

Ohne Liveness Probe läuft ein deadlockter Prozess ewig weiter und blockiert Ressourcen.

Premature Traffic

Ohne Readiness Probe erhält ein nicht-initialisierter Pod Traffic und erzeugt Fehler.

Default-Timeout zu tolerant

Cloud-Provider-Defaults sind oft 30s Interval und 3 Fehlschläge – zu lang für schnelle Recovery.

Anforderung

Alle Services MÜSSEN:

/health/live Endpoint exponieren (Liveness: Prozess lebt)
/health/ready Endpoint exponieren (Readiness: Traffic-fähig, Abhängigkeiten OK)
Kubernetes: readinessProbe und livenessProbe mit gemessenen initialDelaySeconds konfigurieren
Load Balancer Health Checks: expliziter Path, Interval, Timeout, Healthy/Unhealthy Threshold
Keine Cloud-Provider-Defaults für Health Check Konfiguration

Implementierungsanleitung

Endpoints implementieren: /health/live (nur Prozess), /health/ready (Deps prüfen)
initialDelaySeconds messen: Startup-Zeit des Services messen, Puffer addieren
Intervalle konfigurieren: interval=15s, timeout=5s, failureThreshold=3
ALB Health Check: path=/health/ready, interval=15, matcher=200
Liveness NUR für Prozess-Liveness: Keine externe Abhängigkeiten in Liveness-Probe prüfen
Testen: Health Check Failure in Staging simulieren und Verhalten beobachten

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Keine Health Checks	Keine Probes; LB verwendet TCP-Ping als Health Check.
2	Basis LB Health Check	ALB Health Check auf "/" konfiguriert; keine Kubernetes Probes.
3	ReadinessProbe + LivenessProbe	Beide Probes mit gemessenen Delays; LB prüft /health/ready; Fehler erzeugen Alerts.
4	Deep Health Checks	Readiness prüft echte Abhängigkeiten; StartupProbe für langsame Services.
5	Synthetisches Monitoring	Externe Validierung der Health Endpoints; Health-Check-Latenz als SLI.

Level

Bezeichnung

Kriterien

Keine Health Checks

Keine Probes; LB verwendet TCP-Ping als Health Check.

Basis LB Health Check

ALB Health Check auf "/" konfiguriert; keine Kubernetes Probes.

ReadinessProbe + LivenessProbe

Beide Probes mit gemessenen Delays; LB prüft /health/ready; Fehler erzeugen Alerts.

Deep Health Checks

Readiness prüft echte Abhängigkeiten; StartupProbe für langsame Services.

Synthetisches Monitoring

Externe Validierung der Health Endpoints; Health-Check-Latenz als SLI.

Terraform Checks

waf-rel-020.tf.aws.alb-target-group-health-check

Prüft: ALB Target Group hat expliziten health_check Block mit Path, Interval und Thresholds.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_lb_target_group" "api" { name = "payment-api-tg" port = 8080 protocol = "HTTP" vpc_id = var.vpc_id health_check { enabled = true path = "/health/ready" interval = 15 timeout = 5 healthy_threshold = 2 unhealthy_threshold = 3 matcher = "200" } }`	`resource "aws_lb_target_group" "api" { name = "payment-api-tg" port = 8080 protocol = "HTTP" vpc_id = var.vpc_id # Kein health_check Block – # Cloud-Defaults werden verwendet # WAF-REL-020 Violation }`

resource "aws_lb_target_group" "api" {
  name     = "payment-api-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled           = true
    path              = "/health/ready"
    interval          = 15
    timeout           = 5
    healthy_threshold = 2
    unhealthy_threshold = 3
    matcher           = "200"
  }
}

resource "aws_lb_target_group" "api" {
  name     = "payment-api-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
  # Kein health_check Block –
  # Cloud-Defaults werden verwendet
  # WAF-REL-020 Violation
}

Remediation: health_check Block mit explizitem path, interval, timeout, healthy_threshold und unhealthy_threshold hinzufügen.

Evidenz

Typ	Pflicht	Beschreibung
IaC	✅ Pflicht	Terraform oder Kubernetes-Manifeste mit Readiness- und Liveness-Probe-Konfiguration.
Config	✅ Pflicht	Load Balancer Health Check Konfiguration mit explizitem Path und Thresholds.
Process	Optional	Test-Ergebnisse: Health Check Failure in Staging simuliert und dokumentiert.

Typ

Pflicht

Beschreibung

IaC

✅ Pflicht

Terraform oder Kubernetes-Manifeste mit Readiness- und Liveness-Probe-Konfiguration.

Config

✅ Pflicht

Load Balancer Health Check Konfiguration mit explizitem Path und Thresholds.

Process

Optional

Test-Ergebnisse: Health Check Failure in Staging simuliert und dokumentiert.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

Health Checks & Readiness/Liveness Probes →

WAF-REL-020 – Health Checks & Readiness Probes Configured

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-020.tf.aws.alb-target-group-health-check

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice