WAF-REL-050 – Circuit Breaker & Timeout Configuration

Pillar: Reliability | Severity: High | Category: Resilience Patterns | Automatable: High

Description

All outgoing HTTP/gRPC calls MUST define explicit timeout values. Critical service dependencies MUST implement circuit breakers. Retry logic MUST use exponential backoff with jitter. Connection pools MUST define maximum sizes. No service may use default timeouts (undefined or infinite) for external calls.

Rationale

Cascading failures are the primary cause of major cloud outages. Without circuit breakers, a slow dependency gradually exhausts thread pools and connection pools until the dependent service itself fails and the cascade continues. Explicit timeouts prevent resource threads from waiting forever on non-responding services.

Threat Context

Risk	Description
Thread Pool Exhaustion	Slow external API leaves all handler threads waiting → service completely blocked.
Connection Pool Depletion	Shared DB connection pool is exhausted → all dependent services fail.
Retry Storm	1000 clients retry synchronously without jitter → 1000x load spike on degraded service.
Optional Dep Brings Down Main Service	Non-critical enrichment API without circuit breaker → total failure instead of feature loss.

Risk

Description

Thread Pool Exhaustion

Slow external API leaves all handler threads waiting → service completely blocked.

Connection Pool Depletion

Shared DB connection pool is exhausted → all dependent services fail.

Retry Storm

1000 clients retry synchronously without jitter → 1000x load spike on degraded service.

Optional Dep Brings Down Main Service

Non-critical enrichment API without circuit breaker → total failure instead of feature loss.

Requirement

Explicit timeouts for all outgoing calls (connect + read separately)
Circuit breaker for all critical synchronous dependencies
Retry: maximum 3 attempts, exponential backoff (100ms → 200ms → 400ms), jitter ±50%
Connection pool: maximum size per dependency class defined
Bulkhead: separate resource pools for different dependency classes
Load balancer: explicit idle_timeout – no provider default

Implementation Guidance

Timeout audit: Check all outgoing HTTP clients for explicit timeout values
Configure circuit breaker: Resilience4j, pybreaker, or service mesh outlierDetection
Configure retry: maxAttempts=3, initialDelay=100ms, multiplier=2, jitter=0.5
Connection pools: Separate HTTP client instances per dependency
ALB idle_timeout: Set explicitly to match API latency (typically 30s for REST APIs)
Chaos test: Validate circuit breaker through latency injection

Maturity Levels

Level	Name	Criteria
1	No Timeouts	Default/infinite timeouts for external calls.
2	Timeouts Configured	Connect and read timeouts defined for external HTTP calls.
3	Circuit Breaker + Retry	CB for all critical deps; retry with backoff and jitter; connection pools.
4	Bulkheads + Service Mesh	Bulkhead isolation; Istio/Linkerd manages CB declaratively; chaos tests.
5	Adaptive Thresholds	CB thresholds auto-tuned; request hedging; complete resilience matrix.

Level

Name

Criteria

No Timeouts

Default/infinite timeouts for external calls.

Timeouts Configured

Connect and read timeouts defined for external HTTP calls.

Circuit Breaker + Retry

CB for all critical deps; retry with backoff and jitter; connection pools.

Bulkheads + Service Mesh

Bulkhead isolation; Istio/Linkerd manages CB declaratively; chaos tests.

Adaptive Thresholds

CB thresholds auto-tuned; request hedging; complete resilience matrix.

Terraform Checks

waf-rel-050.tf.aws.alb-idle-timeout

Checks: ALB has explicit idle_timeout – no provider default.

Compliant Non-Compliant

Compliant	Non-Compliant
`resource "aws_lb" "api" { name = "payment-api-alb" load_balancer_type = "application" security_groups = [aws_security_group.alb.id] subnets = var.public_subnet_ids idle_timeout = 30 # Explicitly set tags = var.mandatory_tags }`	`resource "aws_lb" "api" { name = "payment-api-alb" load_balancer_type = "application" security_groups = [aws_security_group.alb.id] subnets = var.public_subnet_ids # No idle_timeout – # AWS default 60s is used # WAF-REL-050 Violation }`

resource "aws_lb" "api" {
  name               = "payment-api-alb"
  load_balancer_type = "application"
  security_groups    =
    [aws_security_group.alb.id]
  subnets = var.public_subnet_ids
  idle_timeout = 30  # Explicitly set
  tags = var.mandatory_tags
}

resource "aws_lb" "api" {
  name               = "payment-api-alb"
  load_balancer_type = "application"
  security_groups    =
    [aws_security_group.alb.id]
  subnets = var.public_subnet_ids
  # No idle_timeout –
  # AWS default 60s is used
  # WAF-REL-050 Violation
}

Remediation: Set idle_timeout explicitly. REST APIs: 30s; file uploads: 300s.

Evidence

Type	Required	Description
IaC	✅ Required	Terraform or service mesh configuration with timeout and circuit breaker settings.
Config	✅ Required	Application configuration files with explicit timeout values for all dependencies.
Process	Optional	Latency injection test results with circuit breaker activation documented.

Type

Required

Description

IaC

✅ Required

Terraform or service mesh configuration with timeout and circuit breaker settings.

Config

✅ Required

Application configuration files with explicit timeout values for all dependencies.

Process

Optional

Latency injection test results with circuit breaker activation documented.

Related Controls

Regulatorisches Mapping

Framework	Controls
ISO/IEC 27001:2022	A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents
ITIL 4	SVS – Service value system; DP – Design principle; OV – Operation value chain
AWS Well-Architected Framework	Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor
SRE Book (Google)	Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring
CNCF Cloud Native Security	SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials
BSI C5:2022	SIM-01 – Security incident management; SIM-02 – Security information and event management
GDPR	Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach
NIST SP 800-161	SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls
DORA	Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools
COBIT 2019	DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity
TISAX	Information security – Incident response
ANSSI SecNumCloud	Domain – Incident response; Domain – Business continuity
BIO	BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit
ENS High	op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio
UK NCSC CAF	D1 – Response and recovery planning; D2 – Lessons learned
CMMC 2.0	IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents
IRAP	ISM – Incident management; ISM – Business continuity
CCCS PBMM	IR-4 – Incident handling; IR-8 – Incident response plan
MAS TRM	Ch.10 – Security incident management; Ch.11 – Business continuity
ISMAP	Reliability and incident management
FISC	Operational measures – Incident response

Framework

Controls

ISO/IEC 27001:2022

A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents

ITIL 4

SVS – Service value system; DP – Design principle; OV – Operation value chain

AWS Well-Architected Framework

Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor

SRE Book (Google)

Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring

CNCF Cloud Native Security

SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials

BSI C5:2022

SIM-01 – Security incident management; SIM-02 – Security information and event management

GDPR

Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach

NIST SP 800-161

SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls

DORA

Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools

COBIT 2019

DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity

TISAX

Information security – Incident response

ANSSI SecNumCloud

Domain – Incident response; Domain – Business continuity

BIO

BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit

ENS High

op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio

UK NCSC CAF

D1 – Response and recovery planning; D2 – Lessons learned

CMMC 2.0

IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents

IRAP

ISM – Incident management; ISM – Business continuity

CCCS PBMM

IR-4 – Incident handling; IR-8 – Incident response plan

MAS TRM

Ch.10 – Security incident management; Ch.11 – Business continuity

ISMAP

Reliability and incident management

FISC

Operational measures – Incident response

WAF-REL-050 – Circuit Breaker & Timeout Configuration

Description

Rationale

Threat Context

Requirement

Implementation Guidance

Maturity Levels

Terraform Checks

waf-rel-050.tf.aws.alb-idle-timeout

Evidence

Related Controls

Regulatorisches Mapping

Best Practice