Post

Circuit Breaker & Bulkhead

Circuit Breaker fails fast when downstream is broken, preventing cascading failures. Bulkhead isolates resources per service, preventing one slow dependency from starving others.

Circuit Breaker & Bulkhead

Circuit Breaker fails fast when downstream is broken, preventing cascading failures. Bulkhead isolates resources per service, preventing one slow dependency from starving others.

Circuit Breaker State Machine

Circuit breaker prevents cascading failures by failing fast and periodically testing recovery.

1
2
3
4
5
6
7
           ↓ threshold exceeded
[CLOSED]──────→ [OPEN] ──→ [HALF-OPEN] ──→ [CLOSED]
(pass)         (fail fast)   (test probe)      (success)
    ↑                             ↑
    └─────────────────────────────┘
           ← recovery timeout expires
                   ← test fails
State Behavior Transition
Closed Pass requests through; count failures → Open if error rate > threshold, or timeouts > threshold, or consecutive failures > min_calls
Open Reject all requests immediately (fail fast); return error or fallback → Half-Open after recovery_timeout expires
Half-Open Allow limited test requests to verify recovery → Closed if tests succeed; → Open if tests fail

Circuit Breaker Configuration

Parameter Example Effect Rationale
Failure threshold 50% error rate Switch to Open Fail fast after threshold breached
Minimum calls 10 requests Ignore threshold until min reached Avoid flapping on low traffic
Recovery timeout 60 seconds Wait before attempting Half-Open Give downstream time to recover
Half-Open max calls 3 requests Max tests to verify recovery Balance speed vs. safety
Slow call threshold p99 latency > 2s Count as failure Fail fast on slow, not just errors
Slow call duration %ile p99 (vs. p50) Use high percentile Avoid false positives from occasional slow requests

State Transitions in Action

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class CircuitBreaker:
    """
    Prevent cascading failures by failing fast.
    States: CLOSED (pass), OPEN (fail-fast), HALF_OPEN (test).
    """

    def __init__(self, failure_threshold=0.5, min_calls=10, timeout_seconds=60):
        self.failure_threshold = failure_threshold
        self.min_calls = min_calls
        self.timeout_seconds = timeout_seconds
        self.state = "CLOSED"
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        """Execute function; update circuit breaker state."""
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout_seconds:
                self.state = "HALF_OPEN"
                self.failure_count = 0  # Reset for testing
            else:
                raise CircuitBreakerOpenException("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.success_count += 1
        if self.state == "HALF_OPEN" and self.success_count >= 2:
            self.state = "CLOSED"  # Recovery confirmed
            self.failure_count = 0
            self.success_count = 0

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        error_rate = self.failure_count / (self.failure_count + self.success_count)

        if self.state == "CLOSED" and self.failure_count >= self.min_calls:
            if error_rate > self.failure_threshold:
                self.state = "OPEN"  # Fail fast

        elif self.state == "HALF_OPEN":
            self.state = "OPEN"  # Test failed, reopen

Bulkhead Pattern

Bulkhead isolates resources (threads, connections) per service, preventing one slow dependency from exhausting shared resources.

Thread Pool Isolation

Each external service gets its own thread pool. If ServiceB is slow, its thread pool saturates, but ServiceA and ServiceC continue.

1
2
3
4
5
6
7
8
9
Request Handler
├─ Thread Pool 1 (10 threads) → ServiceA
├─ Thread Pool 2 (5 threads) → ServiceB (slow)
└─ Thread Pool 3 (10 threads) → ServiceC

If ServiceB thread pool saturates:
  ServiceA threads continue
  ServiceC threads continue
  ServiceB requests queued or rejected (fail fast)

Configuration:

1
2
3
executor_pool_a = ThreadPoolExecutor(max_workers=10, queue_size=50, name="ServiceA")
executor_pool_b = ThreadPoolExecutor(max_workers=5, queue_size=20, name="ServiceB")  # Smaller
executor_pool_c = ThreadPoolExecutor(max_workers=10, queue_size=50, name="ServiceC")

Semaphore Isolation

Lightweight alternative to thread pools. Caller thread blocked on semaphore acquire (no thread creation).

1
2
3
4
5
6
7
8
semaphore_service_a = Semaphore(max_concurrent=20)

def call_service_a():
    with semaphore_service_a:  # Acquire permit
        # If max_concurrent reached, block here
        result = requests.get("http://service-a/...")
    # Release permit when done
    return result

Pros: No thread pool overhead, simpler, better for async/non-blocking code.

Cons: Caller thread blocks, not suitable for high-concurrency scenarios.

Resilience Strategy: Timeout + Retry + Jitter + Fallback

Combine multiple patterns for robust failure handling.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def resilient_call(service_name: str, request_func, fallback_func=None):
    """
    Make a resilient call with timeout, retry, jitter, fallback.
    """
    cb = circuit_breaker[service_name]
    max_retries = 3
    base_delay = 0.1  # 100ms
    jitter_factor = 0.1  # 10% jitter

    for attempt in range(max_retries):
        try:
            # Execute with circuit breaker
            return cb.call(request_func, timeout=2.0)

        except CircuitBreakerOpenException:
            # Circuit is open; use fallback
            if fallback_func:
                return fallback_func()
            raise

        except TimeoutException:
            if attempt < max_retries - 1:
                # Exponential backoff + jitter
                delay = base_delay * (2 ** attempt)
                jitter = delay * jitter_factor * random.random()
                time.sleep(delay + jitter)
            else:
                raise

        except Exception as e:
            # Other errors: don't retry
            raise

    raise RuntimeError(f"All {max_retries} attempts failed")

# Usage
def call_payment_service():
    return requests.post("http://payment-service/charge", json=charge_request)

def fallback_payment():
    # Queue charge for async processing
    queue.put(charge_request)
    return {"status": "queued"}

result = resilient_call("payment-service", call_payment_service, fallback_payment)

How Real Systems Use This

Netflix: Hystrix (Circuit Breaker + Bulkhead)

Architecture: Each service call wrapped in Hystrix command with circuit breaker + thread pool bulkhead.

Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Each downstream service has its own command
class GetOrderCommand(HystrixCommand):
    def __init__(self, order_id):
        super().__init__(
            group_key="OrderService",
            command_key="GetOrder",
            thread_pool_key="OrderService",
            thread_pool=ThreadPoolConfig(core_size=20, queue_size=100),
            circuit_breaker=CircuitBreakerConfig(
                error_threshold_percentage=50,
                request_volume_threshold=20,
                sleep_window_in_milliseconds=5000
            )
        )
        self.order_id = order_id

    def run(self):
        return requests.get(f"http://order-service/orders/{self.order_id}").json()

    def get_fallback(self):
        # Return cached version if available
        return cache.get(f"order:{self.order_id}", None)

# Usage
result = GetOrderCommand(order_id=123).execute()

Result: One slow or failing service cannot bring down the entire Netflix platform. 300+ services call each other with confidence.

AWS: DynamoDB Rate Limiting + Backoff

Pattern: Client-side exponential backoff + jitter when receiving ProvisionedThroughputExceededException.

Why: DynamoDB’s rate limiting can cascade if all clients retry immediately. Jitter prevents thundering herd.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def write_to_dynamodb(table, item, max_retries=3):
    """Write with exponential backoff + jitter."""
    for attempt in range(max_retries):
        try:
            table.put_item(Item=item)
            return
        except ClientError as e:
            if e.response['Error']['Code'] == 'ProvisionedThroughputExceededException':
                if attempt < max_retries - 1:
                    delay = 2 ** attempt  # 1s, 2s, 4s
                    jitter = delay * 0.1 * random.random()
                    time.sleep(delay + jitter)
                else:
                    raise

Google Cloud: Fallback and Graceful Degradation

Pattern: When downstream unavailable, serve stale data or partial results.

Example:

1
2
3
4
5
6
7
8
def get_recommendations():
    """Get recommendations, with fallback to popular items."""
    try:
        # Try ML model
        return ml_model.recommend(user_id, timeout=1.0)
    except (TimeoutException, ServiceDownException):
        # Fallback to popular items (much cheaper)
        return get_popular_items()

Result: User sees recommendations even if ML service is down.

References

This post is licensed under CC BY 4.0 by the author.