Best Practice: Circuit Breaker, Timeouts & Bulkheads

Context

Cascading failures are the most common cause of major cloud outages. A slow or failing dependency without a circuit breaker gradually exhausts thread pools, connection pools and request queues of the dependent service – until it also fails.

Common problems without structured resilience patterns:

A slow external API leaves all API handler threads waiting for a timeout
Database connection pool is exhausted and blocks all services sharing the same pool
Retry storms: 1000 clients simultaneously retry their failed requests
An optional enrichment service fails and brings down the entire main service

Related Controls

WAF-REL-050 – Circuit Breaker & Timeout Configuration
WAF-REL-080 – Dependency & Upstream Resilience Management

Target State

Every outgoing call has explicit timeouts
Critical dependencies have circuit breakers with defined thresholds
Retry logic prevents storms through exponential backoff with jitter
Bulkheads isolate different dependency classes into separate resource pools

Technical Implementation

Python: Circuit Breaker with pybreaker

import pybreaker
import httpx
import asyncio
import logging
from datetime import datetime

# Configure circuit breaker
payment_gateway_cb = pybreaker.CircuitBreaker(
    fail_max=5,          # After 5 failures: OPEN
    reset_timeout=30,    # After 30s: HALF-OPEN (one test request)
    name="payment-gateway"
)

# Event listener for logging
@payment_gateway_cb.on_state_change
def log_state_change(cb, old_state, new_state):
    logging.warning(f"CircuitBreaker[{cb.name}]: {old_state} -> {new_state}")

async def charge_card(card_token: str, amount: float) -> dict:
    """Payment with circuit breaker and timeout."""
    try:
        # Circuit breaker wrapping + timeout
        async with httpx.AsyncClient(timeout=httpx.Timeout(3.0)) as client:
            response = await payment_gateway_cb.call_async(
                client.post,
                "https://payment-gateway.example.com/charge",
                json={"card_token": card_token, "amount": amount}
            )
            return response.json()

    except pybreaker.CircuitBreakerError:
        # Circuit is OPEN: immediate rejection, no wait time
        logging.warning("Payment gateway circuit open – fast failing")
        raise ServiceUnavailableError("Payment gateway temporarily unavailable")

    except httpx.TimeoutException:
        # Timeout exceeded
        raise ServiceTimeoutError("Payment gateway timeout after 3s")

# Retry with exponential backoff + jitter
async def charge_with_retry(card_token: str, amount: float) -> dict:
    max_attempts = 3
    for attempt in range(max_attempts):
        try:
            return await charge_card(card_token, amount)
        except ServiceTimeoutError:
            if attempt == max_attempts - 1:
                raise
            # Exponential backoff + jitter
            wait = (2 ** attempt) + (asyncio.get_event_loop().time() % 1)
            await asyncio.sleep(wait)
    raise ServiceUnavailableError("Max retry attempts exceeded")

Java/Spring Boot: Resilience4j

// application.yml
resilience4j:
  circuitbreaker:
    instances:
      payment-gateway:
        registerHealthIndicator: true
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        permittedNumberOfCallsInHalfOpenState: 2
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState: 30s
        failureRateThreshold: 50       # 50% error rate → OPEN
        slowCallDurationThreshold: 2s
        slowCallRateThreshold: 80
  retry:
    instances:
      payment-gateway:
        maxAttempts: 3
        waitDuration: 100ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        randomizedWaitFactor: 0.5      # Jitter ±50%
  bulkhead:
    instances:
      payment-gateway:
        maxConcurrentCalls: 10         # Max. concurrent calls
        maxWaitDuration: 100ms

// PaymentService.java
@Service
public class PaymentService {

    @CircuitBreaker(name = "payment-gateway", fallbackMethod = "paymentFallback")
    @Retry(name = "payment-gateway")
    @Bulkhead(name = "payment-gateway")
    public ChargeResult chargeCard(String cardToken, BigDecimal amount) {
        return paymentGatewayClient.charge(cardToken, amount);
    }

    public ChargeResult paymentFallback(String cardToken, BigDecimal amount,
                                         CallNotPermittedException ex) {
        // Circuit open: offline queue for later processing
        offlineQueue.enqueue(cardToken, amount);
        return ChargeResult.queued("Payment queued for processing");
    }
}

Istio: Service Mesh Circuit Breaking

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-gateway
  namespace: payment
spec:
  host: payment-gateway.payment.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 3s
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRequestsPerConnection: 1000
        maxRetries: 3
        retryOn: "5xx,gateway-error,connect-failure,retriable-4xx"
        retryRemoteStatuses: "500,502,503"

    outlierDetection:
      consecutive5xxErrors: 5         # 5 errors → ejection
      interval: 10s                   # Evaluation window
      baseEjectionTime: 30s           # Minimum ejection time
      maxEjectionPercent: 50          # Eject max. 50% of hosts
      splitExternalLocalOriginErrors: true

Terraform: ALB with Timeout

resource "aws_lb" "api" {
  name               = "payment-api-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids
  idle_timeout       = 30    # For REST APIs: 30s; not the default 60s

  tags = var.mandatory_tags
}

resource "aws_lb_target_group" "api" {
  name                 = "payment-api-tg"
  port                 = 8080
  protocol             = "HTTP"
  vpc_id               = var.vpc_id
  deregistration_delay = 30  # Graceful shutdown window

  health_check {
    enabled             = true
    path                = "/health/ready"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Typical Anti-Patterns

Circuit breaker protects all calls including read operations: Too aggressive – configure write and read calls separately
Retry without jitter: All clients retry at the same time → retry storm
Reset timeout too short: Circuit switches to HALF-OPEN too quickly → further failures
Connection pool shared for all DBs: One slow query exhausts the pool for all other services

Metrics

Circuit Breaker Open Rate: Number of minutes per hour in OPEN state (target: < 1 min/h)
Timeout Rate: % of calls that time out (target: < 0.1%)
Retry Rate: % of calls that were retried at least once (target: < 5%)
Bulkhead Rejection Rate: % of calls rejected by the bulkhead

Maturity Level

Level 1 – No timeouts, no circuit breakers
Level 2 – Basic timeouts configured
Level 3 – Circuit breakers for all critical dependencies; retry with backoff
Level 4 – Bulkheads per dependency class; service mesh manages CB declaratively
Level 5 – Adaptive thresholds; request hedging for latency-critical paths