WAF-REL-050 – Circuit Breaker & Timeout Configuration
Description
All outgoing HTTP/gRPC calls MUST define explicit timeout values. Critical service dependencies MUST implement circuit breakers. Retry logic MUST use exponential backoff with jitter. Connection pools MUST define maximum sizes. No service may use default timeouts (undefined or infinite) for external calls.
Rationale
Cascading failures are the primary cause of major cloud outages. Without circuit breakers, a slow dependency gradually exhausts thread pools and connection pools until the dependent service itself fails and the cascade continues. Explicit timeouts prevent resource threads from waiting forever on non-responding services.
Threat Context
| Risk | Description |
|---|---|
Thread Pool Exhaustion |
Slow external API leaves all handler threads waiting → service completely blocked. |
Connection Pool Depletion |
Shared DB connection pool is exhausted → all dependent services fail. |
Retry Storm |
1000 clients retry synchronously without jitter → 1000x load spike on degraded service. |
Optional Dep Brings Down Main Service |
Non-critical enrichment API without circuit breaker → total failure instead of feature loss. |
Requirement
-
Explicit timeouts for all outgoing calls (connect + read separately)
-
Circuit breaker for all critical synchronous dependencies
-
Retry: maximum 3 attempts, exponential backoff (100ms → 200ms → 400ms), jitter ±50%
-
Connection pool: maximum size per dependency class defined
-
Bulkhead: separate resource pools for different dependency classes
-
Load balancer: explicit
idle_timeout– no provider default
Implementation Guidance
-
Timeout audit: Check all outgoing HTTP clients for explicit timeout values
-
Configure circuit breaker: Resilience4j, pybreaker, or service mesh
outlierDetection -
Configure retry:
maxAttempts=3,initialDelay=100ms,multiplier=2,jitter=0.5 -
Connection pools: Separate HTTP client instances per dependency
-
ALB idle_timeout: Set explicitly to match API latency (typically 30s for REST APIs)
-
Chaos test: Validate circuit breaker through latency injection
Maturity Levels
| Level | Name | Criteria |
|---|---|---|
1 |
No Timeouts |
Default/infinite timeouts for external calls. |
2 |
Timeouts Configured |
Connect and read timeouts defined for external HTTP calls. |
3 |
Circuit Breaker + Retry |
CB for all critical deps; retry with backoff and jitter; connection pools. |
4 |
Bulkheads + Service Mesh |
Bulkhead isolation; Istio/Linkerd manages CB declaratively; chaos tests. |
5 |
Adaptive Thresholds |
CB thresholds auto-tuned; request hedging; complete resilience matrix. |
Terraform Checks
waf-rel-050.tf.aws.alb-idle-timeout
Checks: ALB has explicit idle_timeout – no provider default.
| Compliant | Non-Compliant |
|---|---|
|
|
Remediation: Set idle_timeout explicitly. REST APIs: 30s; file uploads: 300s.
Evidence
| Type | Required | Description |
|---|---|---|
IaC |
✅ Required |
Terraform or service mesh configuration with timeout and circuit breaker settings. |
Config |
✅ Required |
Application configuration files with explicit timeout values for all dependencies. |
Process |
Optional |
Latency injection test results with circuit breaker activation documented. |
Regulatorisches Mapping
| Framework | Controls |
|---|---|
ISO/IEC 27001:2022 |
A.5.15 – Threat intelligence; A.5.16 – Threat classification; A.5.24 – Information security incident management; A.5.25 – Assessment and decision on information security events; A.5.26 – Response to information security incidents |
ITIL 4 |
SVS – Service value system; DP – Design principle; OV – Operation value chain |
AWS Well-Architected Framework |
Reliability Pillar – Prepare; Reliability Pillar – Deploy; Reliability Pillar – Monitor |
SRE Book (Google) |
Chapter 4 – Service Level Objectives; Chapter 5 – Eliminating toil; Chapter 6 – Monitoring |
CNCF Cloud Native Security |
SLSA – Supply chain Levels for Software Artifacts; SBOM – Software Bill of Materials |
BSI C5:2022 |
SIM-01 – Security incident management; SIM-02 – Security information and event management |
GDPR |
Art. 32 – Security of processing; Art. 33 – Breach notification; Art. 34 – Communication of breach |
NIST SP 800-161 |
SR-1 – Supply chain risk management; SR-2 – Supplier agreements; SR-3 – Supply chain controls |
DORA |
Art. 9 – Protection and prevention; Art. 13 – ICT incident reporting; Art. 17 – Testing of ICT tools |
COBIT 2019 |
DSS04.01.01 – Ensure service availability; DSS04.01.02 – Ensure service capacity |
TISAX |
Information security – Incident response |
ANSSI SecNumCloud |
Domain – Incident response; Domain – Business continuity |
BIO |
BIO – Incidentmanagement; BIO – Bedrijfscontinuïteit |
ENS High |
op.exp.7 – Gestión de incidentes; op.exp.8 – Gestión de la continuidad del negocio |
UK NCSC CAF |
D1 – Response and recovery planning; D2 – Lessons learned |
CMMC 2.0 |
IR.L2-3.6.1 – Establish incident handling capability; IR.L2-3.6.2 – Track, document and report incidents |
IRAP |
ISM – Incident management; ISM – Business continuity |
CCCS PBMM |
IR-4 – Incident handling; IR-8 – Incident response plan |
MAS TRM |
Ch.10 – Security incident management; Ch.11 – Business continuity |
ISMAP |
Reliability and incident management |
FISC |
Operational measures – Incident response |