Best Practice: Circuit Breaker, Timeouts & Bulkheads
Context
Cascading failures are the most common cause of major cloud outages. A slow or failing dependency without a circuit breaker gradually exhausts thread pools, connection pools and request queues of the dependent service – until it also fails.
Common problems without structured resilience patterns:
-
A slow external API leaves all API handler threads waiting for a timeout
-
Database connection pool is exhausted and blocks all services sharing the same pool
-
Retry storms: 1000 clients simultaneously retry their failed requests
-
An optional enrichment service fails and brings down the entire main service
Related Controls
-
WAF-REL-050 – Circuit Breaker & Timeout Configuration
-
WAF-REL-080 – Dependency & Upstream Resilience Management
Target State
-
Every outgoing call has explicit timeouts
-
Critical dependencies have circuit breakers with defined thresholds
-
Retry logic prevents storms through exponential backoff with jitter
-
Bulkheads isolate different dependency classes into separate resource pools
Technical Implementation
Python: Circuit Breaker with pybreaker
import pybreaker
import httpx
import asyncio
import logging
from datetime import datetime
# Configure circuit breaker
payment_gateway_cb = pybreaker.CircuitBreaker(
fail_max=5, # After 5 failures: OPEN
reset_timeout=30, # After 30s: HALF-OPEN (one test request)
name="payment-gateway"
)
# Event listener for logging
@payment_gateway_cb.on_state_change
def log_state_change(cb, old_state, new_state):
logging.warning(f"CircuitBreaker[{cb.name}]: {old_state} -> {new_state}")
async def charge_card(card_token: str, amount: float) -> dict:
"""Payment with circuit breaker and timeout."""
try:
# Circuit breaker wrapping + timeout
async with httpx.AsyncClient(timeout=httpx.Timeout(3.0)) as client:
response = await payment_gateway_cb.call_async(
client.post,
"https://payment-gateway.example.com/charge",
json={"card_token": card_token, "amount": amount}
)
return response.json()
except pybreaker.CircuitBreakerError:
# Circuit is OPEN: immediate rejection, no wait time
logging.warning("Payment gateway circuit open – fast failing")
raise ServiceUnavailableError("Payment gateway temporarily unavailable")
except httpx.TimeoutException:
# Timeout exceeded
raise ServiceTimeoutError("Payment gateway timeout after 3s")
# Retry with exponential backoff + jitter
async def charge_with_retry(card_token: str, amount: float) -> dict:
max_attempts = 3
for attempt in range(max_attempts):
try:
return await charge_card(card_token, amount)
except ServiceTimeoutError:
if attempt == max_attempts - 1:
raise
# Exponential backoff + jitter
wait = (2 ** attempt) + (asyncio.get_event_loop().time() % 1)
await asyncio.sleep(wait)
raise ServiceUnavailableError("Max retry attempts exceeded")
Java/Spring Boot: Resilience4j
// application.yml
resilience4j:
circuitbreaker:
instances:
payment-gateway:
registerHealthIndicator: true
slidingWindowSize: 10
minimumNumberOfCalls: 5
permittedNumberOfCallsInHalfOpenState: 2
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 30s
failureRateThreshold: 50 # 50% error rate → OPEN
slowCallDurationThreshold: 2s
slowCallRateThreshold: 80
retry:
instances:
payment-gateway:
maxAttempts: 3
waitDuration: 100ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
randomizedWaitFactor: 0.5 # Jitter ±50%
bulkhead:
instances:
payment-gateway:
maxConcurrentCalls: 10 # Max. concurrent calls
maxWaitDuration: 100ms
// PaymentService.java
@Service
public class PaymentService {
@CircuitBreaker(name = "payment-gateway", fallbackMethod = "paymentFallback")
@Retry(name = "payment-gateway")
@Bulkhead(name = "payment-gateway")
public ChargeResult chargeCard(String cardToken, BigDecimal amount) {
return paymentGatewayClient.charge(cardToken, amount);
}
public ChargeResult paymentFallback(String cardToken, BigDecimal amount,
CallNotPermittedException ex) {
// Circuit open: offline queue for later processing
offlineQueue.enqueue(cardToken, amount);
return ChargeResult.queued("Payment queued for processing");
}
}
Istio: Service Mesh Circuit Breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-gateway
namespace: payment
spec:
host: payment-gateway.payment.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 3s
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 1000
maxRetries: 3
retryOn: "5xx,gateway-error,connect-failure,retriable-4xx"
retryRemoteStatuses: "500,502,503"
outlierDetection:
consecutive5xxErrors: 5 # 5 errors → ejection
interval: 10s # Evaluation window
baseEjectionTime: 30s # Minimum ejection time
maxEjectionPercent: 50 # Eject max. 50% of hosts
splitExternalLocalOriginErrors: true
Terraform: ALB with Timeout
resource "aws_lb" "api" {
name = "payment-api-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.public_subnet_ids
idle_timeout = 30 # For REST APIs: 30s; not the default 60s
tags = var.mandatory_tags
}
resource "aws_lb_target_group" "api" {
name = "payment-api-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
deregistration_delay = 30 # Graceful shutdown window
health_check {
enabled = true
path = "/health/ready"
interval = 15
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
matcher = "200"
}
}
Typical Anti-Patterns
-
Circuit breaker protects all calls including read operations: Too aggressive – configure write and read calls separately
-
Retry without jitter: All clients retry at the same time → retry storm
-
Reset timeout too short: Circuit switches to HALF-OPEN too quickly → further failures
-
Connection pool shared for all DBs: One slow query exhausts the pool for all other services
Metrics
-
Circuit Breaker Open Rate: Number of minutes per hour in OPEN state (target: < 1 min/h)
-
Timeout Rate: % of calls that time out (target: < 0.1%)
-
Retry Rate: % of calls that were retried at least once (target: < 5%)
-
Bulkhead Rejection Rate: % of calls rejected by the bulkhead
Maturity Level
Level 1 – No timeouts, no circuit breakers
Level 2 – Basic timeouts configured
Level 3 – Circuit breakers for all critical dependencies; retry with backoff
Level 4 – Bulkheads per dependency class; service mesh manages CB declaratively
Level 5 – Adaptive thresholds; request hedging for latency-critical paths