WAF-REL-050 – Circuit Breaker & Timeout Configuration

Pillar: Reliability | Severity: High | Kategorie: Resilience Patterns | Automatisierbar: Hoch

Beschreibung

Alle ausgehenden HTTP/gRPC-Calls MÜSSEN explizite Timeout-Werte definieren. Kritische Service-Abhängigkeiten MÜSSEN Circuit Breaker implementieren. Retry-Logik MUSS Exponential Backoff mit Jitter verwenden. Connection Pools MÜSSEN maximale Größen definieren. Kein Service darf Default-Timeouts (undefined oder infinite) für externe Calls verwenden.

Rationale

Cascading Failures sind die primäre Ursache großer Cloud-Outages. Ohne Circuit Breaker erschöpft eine langsame Abhängigkeit schrittweise Thread Pools und Connection Pools, bis der abhängige Service selbst ausfällt und die Kaskade sich fortsetzt. Explizite Timeouts verhindern, dass Ressource-Threads ewig auf nicht-antwortende Services warten.

Bedrohungskontext

Risiko	Beschreibung
Thread Pool Exhaustion	Langsame externe API lässt alle Handler-Threads warten → Service vollständig blockiert.
Connection Pool Depletion	Geteilter DB-Connection-Pool erschöpft sich → alle abhängigen Services scheitern.
Retry Storm	1000 Clients retrien synchron ohne Jitter → 1000x Lastspitze auf degradiertem Service.
Optionale Dep reißt Hauptservice mit	Nicht-kritische Enrichment-API ohne Circuit Breaker → Totalausfall statt Feature-Verlust.

Risiko

Beschreibung

Thread Pool Exhaustion

Langsame externe API lässt alle Handler-Threads warten → Service vollständig blockiert.

Connection Pool Depletion

Geteilter DB-Connection-Pool erschöpft sich → alle abhängigen Services scheitern.

Retry Storm

1000 Clients retrien synchron ohne Jitter → 1000x Lastspitze auf degradiertem Service.

Optionale Dep reißt Hauptservice mit

Nicht-kritische Enrichment-API ohne Circuit Breaker → Totalausfall statt Feature-Verlust.

Anforderung

Explizite Timeouts für alle ausgehenden Calls (connect + read getrennt)
Circuit Breaker für alle kritischen synchronen Abhängigkeiten
Retry: maximal 3 Versuche, Exponential Backoff (100ms → 200ms → 400ms), Jitter ±50%
Connection Pool: maximale Größe pro Abhängigkeitsklasse definiert
Bulkhead: separate Resource Pools für verschiedene Abhängigkeitsklassen
Load Balancer: expliziter idle_timeout – kein Provider-Default

Implementierungsanleitung

Timeout-Audit: Alle ausgehenden HTTP-Clients auf explizite Timeout-Werte prüfen
Circuit Breaker konfigurieren: Resilience4j, pybreaker, oder Service Mesh outlierDetection
Retry konfigurieren: maxAttempts=3, initialDelay=100ms, multiplier=2, jitter=0.5
Connection Pools: Separate HTTP-Client-Instanzen pro Abhängigkeit
ALB idle_timeout: Explizit auf API-Latenz abstimmen (typisch 30s für REST APIs)
Chaos Test: Circuit Breaker durch Latenz-Injektion validieren

Reifegrad-Abstufung

Level	Bezeichnung	Kriterien
1	Keine Timeouts	Default/infinite Timeouts für externe Calls.
2	Timeouts konfiguriert	Connect und Read Timeouts für externe HTTP-Calls definiert.
3	Circuit Breaker + Retry	CB für alle kritischen Deps; Retry mit Backoff und Jitter; Connection Pools.
4	Bulkheads + Service Mesh	Bulkhead-Isolation; Istio/Linkerd verwaltet CB deklarativ; Chaos Tests.
5	Adaptive Thresholds	CB-Schwellenwerte auto-tuned; Request Hedging; vollständige Resilience Matrix.

Level

Bezeichnung

Kriterien

Keine Timeouts

Default/infinite Timeouts für externe Calls.

Timeouts konfiguriert

Connect und Read Timeouts für externe HTTP-Calls definiert.

Circuit Breaker + Retry

CB für alle kritischen Deps; Retry mit Backoff und Jitter; Connection Pools.

Bulkheads + Service Mesh

Bulkhead-Isolation; Istio/Linkerd verwaltet CB deklarativ; Chaos Tests.

Adaptive Thresholds

CB-Schwellenwerte auto-tuned; Request Hedging; vollständige Resilience Matrix.

Terraform Checks

waf-rel-050.tf.aws.alb-idle-timeout

Prüft: ALB hat expliziten idle_timeout – kein Provider-Default.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_lb" "api" { name = "payment-api-alb" load_balancer_type = "application" security_groups = [aws_security_group.alb.id] subnets = var.public_subnet_ids idle_timeout = 30 # Explizit gesetzt tags = var.mandatory_tags }`	`resource "aws_lb" "api" { name = "payment-api-alb" load_balancer_type = "application" security_groups = [aws_security_group.alb.id] subnets = var.public_subnet_ids # Kein idle_timeout – # AWS Default 60s wird verwendet # WAF-REL-050 Violation }`

resource "aws_lb" "api" {
  name               = "payment-api-alb"
  load_balancer_type = "application"
  security_groups    =
    [aws_security_group.alb.id]
  subnets = var.public_subnet_ids
  idle_timeout = 30  # Explizit gesetzt
  tags = var.mandatory_tags
}

resource "aws_lb" "api" {
  name               = "payment-api-alb"
  load_balancer_type = "application"
  security_groups    =
    [aws_security_group.alb.id]
  subnets = var.public_subnet_ids
  # Kein idle_timeout –
  # AWS Default 60s wird verwendet
  # WAF-REL-050 Violation
}

Remediation: idle_timeout explizit setzen. REST APIs: 30s; File Uploads: 300s.

Evidenz

Typ	Pflicht	Beschreibung
IaC	✅ Pflicht	Terraform oder Service Mesh-Konfiguration mit Timeout und Circuit Breaker Settings.
Config	✅ Pflicht	Application-Konfigurationsdateien mit expliziten Timeout-Werten für alle Abhängigkeiten.
Process	Optional	Latenz-Injektions-Testergebnisse mit Circuit Breaker Aktivierung dokumentiert.

Typ

Pflicht

Beschreibung

IaC

✅ Pflicht

Terraform oder Service Mesh-Konfiguration mit Timeout und Circuit Breaker Settings.

Config

✅ Pflicht

Application-Konfigurationsdateien mit expliziten Timeout-Werten für alle Abhängigkeiten.

Process

Optional

Latenz-Injektions-Testergebnisse mit Circuit Breaker Aktivierung dokumentiert.

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents; A.5.27 – Learning from information security incidents; A.5.28 – Collection of evidence; A.8.16 – Technology use identification and monitoring; A.8.21 – Telecommunications and network security

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain; CW – Continual improvement

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor; Reliability Pillar – Improve

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring; Chapter 7 – Emergency response

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management; SIM-03 – Emergency response

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools and systems

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity; DSS04.02.01 – Manage incidents

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

Best Practice

Circuit Breaker, Timeouts & Bulkheads →

WAF-REL-050 – Circuit Breaker & Timeout Configuration

Beschreibung

Rationale

Bedrohungskontext

Anforderung

Implementierungsanleitung

Reifegrad-Abstufung

Terraform Checks

waf-rel-050.tf.aws.alb-idle-timeout

Evidenz

Regulatorisches Mapping

Verwandte Controls

Best Practice