Circuit Breaker & Bulkhead

Circuit Breaker fails fast when downstream is broken, preventing cascading failures. Bulkhead isolates resources per service, preventing one slow dependency from starving others.

Posted Jul 1, 2024

6 min read

Circuit Breaker fails fast when downstream is broken, preventing cascading failures. Bulkhead isolates resources per service, preventing one slow dependency from starving others.

Circuit Breaker State Machine

Circuit breaker prevents cascading failures by failing fast and periodically testing recovery.

           ↓ threshold exceeded
[CLOSED]──────→ [OPEN] ──→ [HALF-OPEN] ──→ [CLOSED]
(pass)         (fail fast)   (test probe)      (success)
    ↑                             ↑
    └─────────────────────────────┘
           ← recovery timeout expires
                   ← test fails

State	Behavior	Transition
Closed	Pass requests through; count failures	→ Open if error rate > threshold, or timeouts > threshold, or consecutive failures > min_calls
Open	Reject all requests immediately (fail fast); return error or fallback	→ Half-Open after recovery_timeout expires
Half-Open	Allow limited test requests to verify recovery	→ Closed if tests succeed; → Open if tests fail

Circuit Breaker Configuration

Parameter	Example	Effect	Rationale
Failure threshold	50% error rate	Switch to Open	Fail fast after threshold breached
Minimum calls	10 requests	Ignore threshold until min reached	Avoid flapping on low traffic
Recovery timeout	60 seconds	Wait before attempting Half-Open	Give downstream time to recover
Half-Open max calls	3 requests	Max tests to verify recovery	Balance speed vs. safety
Slow call threshold	p99 latency > 2s	Count as failure	Fail fast on slow, not just errors
Slow call duration %ile	p99 (vs. p50)	Use high percentile	Avoid false positives from occasional slow requests

State Transitions in Action

        
      
class CircuitBreaker:
    """
    Prevent cascading failures by failing fast.
    States: CLOSED (pass), OPEN (fail-fast), HALF_OPEN (test).
    """

    def __init__(self, failure_threshold=0.5, min_calls=10, timeout_seconds=60):
        self.failure_threshold = failure_threshold
        self.min_calls = min_calls
        self.timeout_seconds = timeout_seconds
        self.state = "CLOSED"
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        """Execute function; update circuit breaker state."""
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout_seconds:
                self.state = "HALF_OPEN"
                self.failure_count = 0  # Reset for testing
            else:
                raise CircuitBreakerOpenException("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.success_count += 1
        if self.state == "HALF_OPEN" and self.success_count >= 2:
            self.state = "CLOSED"  # Recovery confirmed
            self.failure_count = 0
            self.success_count = 0

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        error_rate = self.failure_count / (self.failure_count + self.success_count)

        if self.state == "CLOSED" and self.failure_count >= self.min_calls:
            if error_rate > self.failure_threshold:
                self.state = "OPEN"  # Fail fast

        elif self.state == "HALF_OPEN":
            self.state = "OPEN"  # Test failed, reopen

Bulkhead Pattern

Bulkhead isolates resources (threads, connections) per service, preventing one slow dependency from exhausting shared resources.

Thread Pool Isolation

Each external service gets its own thread pool. If ServiceB is slow, its thread pool saturates, but ServiceA and ServiceC continue.

Request Handler
├─ Thread Pool 1 (10 threads) → ServiceA
├─ Thread Pool 2 (5 threads) → ServiceB (slow)
└─ Thread Pool 3 (10 threads) → ServiceC

If ServiceB thread pool saturates:
  ServiceA threads continue
  ServiceC threads continue
  ServiceB requests queued or rejected (fail fast)

Configuration:

        
      
executor_pool_a = ThreadPoolExecutor(max_workers=10, queue_size=50, name="ServiceA")
executor_pool_b = ThreadPoolExecutor(max_workers=5, queue_size=20, name="ServiceB")  # Smaller
executor_pool_c = ThreadPoolExecutor(max_workers=10, queue_size=50, name="ServiceC")

Semaphore Isolation

Lightweight alternative to thread pools. Caller thread blocked on semaphore acquire (no thread creation).

        
      
semaphore_service_a = Semaphore(max_concurrent=20)

def call_service_a():
    with semaphore_service_a:  # Acquire permit
        # If max_concurrent reached, block here
        result = requests.get("http://service-a/...")
    # Release permit when done
    return result

Pros: No thread pool overhead, simpler, better for async/non-blocking code.

Cons: Caller thread blocks, not suitable for high-concurrency scenarios.

Resilience Strategy: Timeout + Retry + Jitter + Fallback

Combine multiple patterns for robust failure handling.

        
      
def resilient_call(service_name: str, request_func, fallback_func=None):
    """
    Make a resilient call with timeout, retry, jitter, fallback.
    """
    cb = circuit_breaker[service_name]
    max_retries = 3
    base_delay = 0.1  # 100ms
    jitter_factor = 0.1  # 10% jitter

    for attempt in range(max_retries):
        try:
            # Execute with circuit breaker
            return cb.call(request_func, timeout=2.0)

        except CircuitBreakerOpenException:
            # Circuit is open; use fallback
            if fallback_func:
                return fallback_func()
            raise

        except TimeoutException:
            if attempt < max_retries - 1:
                # Exponential backoff + jitter
                delay = base_delay * (2 ** attempt)
                jitter = delay * jitter_factor * random.random()
                time.sleep(delay + jitter)
            else:
                raise

        except Exception as e:
            # Other errors: don't retry
            raise

    raise RuntimeError(f"All {max_retries} attempts failed")

# Usage
def call_payment_service():
    return requests.post("http://payment-service/charge", json=charge_request)

def fallback_payment():
    # Queue charge for async processing
    queue.put(charge_request)
    return {"status": "queued"}

result = resilient_call("payment-service", call_payment_service, fallback_payment)

How Real Systems Use This

Netflix: Hystrix (Circuit Breaker + Bulkhead)

Architecture: Each service call wrapped in Hystrix command with circuit breaker + thread pool bulkhead.

Configuration:

        
      
# Each downstream service has its own command
class GetOrderCommand(HystrixCommand):
    def __init__(self, order_id):
        super().__init__(
            group_key="OrderService",
            command_key="GetOrder",
            thread_pool_key="OrderService",
            thread_pool=ThreadPoolConfig(core_size=20, queue_size=100),
            circuit_breaker=CircuitBreakerConfig(
                error_threshold_percentage=50,
                request_volume_threshold=20,
                sleep_window_in_milliseconds=5000
            )
        )
        self.order_id = order_id

    def run(self):
        return requests.get(f"http://order-service/orders/{self.order_id}").json()

    def get_fallback(self):
        # Return cached version if available
        return cache.get(f"order:{self.order_id}", None)

# Usage
result = GetOrderCommand(order_id=123).execute()

Result: One slow or failing service cannot bring down the entire Netflix platform. 300+ services call each other with confidence.

AWS: DynamoDB Rate Limiting + Backoff

Pattern: Client-side exponential backoff + jitter when receiving ProvisionedThroughputExceededException.

Why: DynamoDB’s rate limiting can cascade if all clients retry immediately. Jitter prevents thundering herd.

        
      
def write_to_dynamodb(table, item, max_retries=3):
    """Write with exponential backoff + jitter."""
    for attempt in range(max_retries):
        try:
            table.put_item(Item=item)
            return
        except ClientError as e:
            if e.response['Error']['Code'] == 'ProvisionedThroughputExceededException':
                if attempt < max_retries - 1:
                    delay = 2 ** attempt  # 1s, 2s, 4s
                    jitter = delay * 0.1 * random.random()
                    time.sleep(delay + jitter)
                else:
                    raise

Google Cloud: Fallback and Graceful Degradation

Pattern: When downstream unavailable, serve stale data or partial results.

Example:

        
      
def get_recommendations():
    """Get recommendations, with fallback to popular items."""
    try:
        # Try ML model
        return ml_model.recommend(user_id, timeout=1.0)
    except (TimeoutException, ServiceDownException):
        # Fallback to popular items (much cheaper)
        return get_popular_items()

Result: User sees recommendations even if ML service is down.

References

Release It! Design and Deploy Production-Ready Software – Michael Nygard (2nd ed., 2018) – Definitive book on circuit breakers, bulkheads, timeouts; includes Hystrix case study.
Hystrix: Latency and Fault Tolerance for Distributed Systems – Netflix (Archived) – Original circuit breaker library documentation.
ByteByteGo – Circuit Breaker Pattern – Visual explanation of state transitions.
Building Microservices – Sam Newman, Ch. 8 – Coverage of resilience patterns in microservices.
AWS Best Practices for Exponential Backoff – Jitter + backoff strategy for distributed systems.

Software Architecture, Distributed Systems

reliability

This post is licensed under CC BY 4.0 by the author.