WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Circuit Breaker, Timeouts & Bulkheads

Context

Cascading failures are the most common cause of major cloud outages. A slow or failing dependency without a circuit breaker gradually exhausts thread pools, connection pools and request queues of the dependent service – until it also fails.

Common problems without structured resilience patterns:

  • A slow external API leaves all API handler threads waiting for a timeout

  • Database connection pool is exhausted and blocks all services sharing the same pool

  • Retry storms: 1000 clients simultaneously retry their failed requests

  • An optional enrichment service fails and brings down the entire main service

  • WAF-REL-050 – Circuit Breaker & Timeout Configuration

  • WAF-REL-080 – Dependency & Upstream Resilience Management

Target State

  • Every outgoing call has explicit timeouts

  • Critical dependencies have circuit breakers with defined thresholds

  • Retry logic prevents storms through exponential backoff with jitter

  • Bulkheads isolate different dependency classes into separate resource pools

Technical Implementation

Python: Circuit Breaker with pybreaker

import pybreaker
import httpx
import asyncio
import logging
from datetime import datetime

# Configure circuit breaker
payment_gateway_cb = pybreaker.CircuitBreaker(
    fail_max=5,          # After 5 failures: OPEN
    reset_timeout=30,    # After 30s: HALF-OPEN (one test request)
    name="payment-gateway"
)

# Event listener for logging
@payment_gateway_cb.on_state_change
def log_state_change(cb, old_state, new_state):
    logging.warning(f"CircuitBreaker[{cb.name}]: {old_state} -> {new_state}")

async def charge_card(card_token: str, amount: float) -> dict:
    """Payment with circuit breaker and timeout."""
    try:
        # Circuit breaker wrapping + timeout
        async with httpx.AsyncClient(timeout=httpx.Timeout(3.0)) as client:
            response = await payment_gateway_cb.call_async(
                client.post,
                "https://payment-gateway.example.com/charge",
                json={"card_token": card_token, "amount": amount}
            )
            return response.json()

    except pybreaker.CircuitBreakerError:
        # Circuit is OPEN: immediate rejection, no wait time
        logging.warning("Payment gateway circuit open – fast failing")
        raise ServiceUnavailableError("Payment gateway temporarily unavailable")

    except httpx.TimeoutException:
        # Timeout exceeded
        raise ServiceTimeoutError("Payment gateway timeout after 3s")

# Retry with exponential backoff + jitter
async def charge_with_retry(card_token: str, amount: float) -> dict:
    max_attempts = 3
    for attempt in range(max_attempts):
        try:
            return await charge_card(card_token, amount)
        except ServiceTimeoutError:
            if attempt == max_attempts - 1:
                raise
            # Exponential backoff + jitter
            wait = (2 ** attempt) + (asyncio.get_event_loop().time() % 1)
            await asyncio.sleep(wait)
    raise ServiceUnavailableError("Max retry attempts exceeded")

Java/Spring Boot: Resilience4j

// application.yml
resilience4j:
  circuitbreaker:
    instances:
      payment-gateway:
        registerHealthIndicator: true
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        permittedNumberOfCallsInHalfOpenState: 2
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState: 30s
        failureRateThreshold: 50       # 50% error rate → OPEN
        slowCallDurationThreshold: 2s
        slowCallRateThreshold: 80
  retry:
    instances:
      payment-gateway:
        maxAttempts: 3
        waitDuration: 100ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        randomizedWaitFactor: 0.5      # Jitter ±50%
  bulkhead:
    instances:
      payment-gateway:
        maxConcurrentCalls: 10         # Max. concurrent calls
        maxWaitDuration: 100ms

// PaymentService.java
@Service
public class PaymentService {

    @CircuitBreaker(name = "payment-gateway", fallbackMethod = "paymentFallback")
    @Retry(name = "payment-gateway")
    @Bulkhead(name = "payment-gateway")
    public ChargeResult chargeCard(String cardToken, BigDecimal amount) {
        return paymentGatewayClient.charge(cardToken, amount);
    }

    public ChargeResult paymentFallback(String cardToken, BigDecimal amount,
                                         CallNotPermittedException ex) {
        // Circuit open: offline queue for later processing
        offlineQueue.enqueue(cardToken, amount);
        return ChargeResult.queued("Payment queued for processing");
    }
}

Istio: Service Mesh Circuit Breaking

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-gateway
  namespace: payment
spec:
  host: payment-gateway.payment.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 3s
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRequestsPerConnection: 1000
        maxRetries: 3
        retryOn: "5xx,gateway-error,connect-failure,retriable-4xx"
        retryRemoteStatuses: "500,502,503"

    outlierDetection:
      consecutive5xxErrors: 5         # 5 errors → ejection
      interval: 10s                   # Evaluation window
      baseEjectionTime: 30s           # Minimum ejection time
      maxEjectionPercent: 50          # Eject max. 50% of hosts
      splitExternalLocalOriginErrors: true

Terraform: ALB with Timeout

resource "aws_lb" "api" {
  name               = "payment-api-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids
  idle_timeout       = 30    # For REST APIs: 30s; not the default 60s

  tags = var.mandatory_tags
}

resource "aws_lb_target_group" "api" {
  name                 = "payment-api-tg"
  port                 = 8080
  protocol             = "HTTP"
  vpc_id               = var.vpc_id
  deregistration_delay = 30  # Graceful shutdown window

  health_check {
    enabled             = true
    path                = "/health/ready"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Typical Anti-Patterns

  • Circuit breaker protects all calls including read operations: Too aggressive – configure write and read calls separately

  • Retry without jitter: All clients retry at the same time → retry storm

  • Reset timeout too short: Circuit switches to HALF-OPEN too quickly → further failures

  • Connection pool shared for all DBs: One slow query exhausts the pool for all other services

Metrics

  • Circuit Breaker Open Rate: Number of minutes per hour in OPEN state (target: < 1 min/h)

  • Timeout Rate: % of calls that time out (target: < 0.1%)

  • Retry Rate: % of calls that were retried at least once (target: < 5%)

  • Bulkhead Rejection Rate: % of calls rejected by the bulkhead

Maturity Level

Level 1 – No timeouts, no circuit breakers
Level 2 – Basic timeouts configured
Level 3 – Circuit breakers for all critical dependencies; retry with backoff
Level 4 – Bulkheads per dependency class; service mesh manages CB declaratively
Level 5 – Adaptive thresholds; request hedging for latency-critical paths