Circuit Breaker & Bulkhead
Circuit Breaker fails fast when downstream is broken, preventing cascading failures. Bulkhead isolates resources per service, preventing one slow dependency from starving others.
Circuit Breaker fails fast when downstream is broken, preventing cascading failures. Bulkhead isolates resources per service, preventing one slow dependency from starving others.
Circuit Breaker State Machine
Circuit breaker prevents cascading failures by failing fast and periodically testing recovery.
1
2
3
4
5
6
7
↓ threshold exceeded
[CLOSED]──────→ [OPEN] ──→ [HALF-OPEN] ──→ [CLOSED]
(pass) (fail fast) (test probe) (success)
↑ ↑
└─────────────────────────────┘
← recovery timeout expires
← test fails
| State | Behavior | Transition |
|---|---|---|
| Closed | Pass requests through; count failures | → Open if error rate > threshold, or timeouts > threshold, or consecutive failures > min_calls |
| Open | Reject all requests immediately (fail fast); return error or fallback | → Half-Open after recovery_timeout expires |
| Half-Open | Allow limited test requests to verify recovery | → Closed if tests succeed; → Open if tests fail |
Circuit Breaker Configuration
| Parameter | Example | Effect | Rationale |
|---|---|---|---|
| Failure threshold | 50% error rate | Switch to Open | Fail fast after threshold breached |
| Minimum calls | 10 requests | Ignore threshold until min reached | Avoid flapping on low traffic |
| Recovery timeout | 60 seconds | Wait before attempting Half-Open | Give downstream time to recover |
| Half-Open max calls | 3 requests | Max tests to verify recovery | Balance speed vs. safety |
| Slow call threshold | p99 latency > 2s | Count as failure | Fail fast on slow, not just errors |
| Slow call duration %ile | p99 (vs. p50) | Use high percentile | Avoid false positives from occasional slow requests |
State Transitions in Action
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class CircuitBreaker:
"""
Prevent cascading failures by failing fast.
States: CLOSED (pass), OPEN (fail-fast), HALF_OPEN (test).
"""
def __init__(self, failure_threshold=0.5, min_calls=10, timeout_seconds=60):
self.failure_threshold = failure_threshold
self.min_calls = min_calls
self.timeout_seconds = timeout_seconds
self.state = "CLOSED"
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
def call(self, func, *args, **kwargs):
"""Execute function; update circuit breaker state."""
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout_seconds:
self.state = "HALF_OPEN"
self.failure_count = 0 # Reset for testing
else:
raise CircuitBreakerOpenException("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.success_count += 1
if self.state == "HALF_OPEN" and self.success_count >= 2:
self.state = "CLOSED" # Recovery confirmed
self.failure_count = 0
self.success_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
error_rate = self.failure_count / (self.failure_count + self.success_count)
if self.state == "CLOSED" and self.failure_count >= self.min_calls:
if error_rate > self.failure_threshold:
self.state = "OPEN" # Fail fast
elif self.state == "HALF_OPEN":
self.state = "OPEN" # Test failed, reopen
Bulkhead Pattern
Bulkhead isolates resources (threads, connections) per service, preventing one slow dependency from exhausting shared resources.
Thread Pool Isolation
Each external service gets its own thread pool. If ServiceB is slow, its thread pool saturates, but ServiceA and ServiceC continue.
1
2
3
4
5
6
7
8
9
Request Handler
├─ Thread Pool 1 (10 threads) → ServiceA
├─ Thread Pool 2 (5 threads) → ServiceB (slow)
└─ Thread Pool 3 (10 threads) → ServiceC
If ServiceB thread pool saturates:
ServiceA threads continue
ServiceC threads continue
ServiceB requests queued or rejected (fail fast)
Configuration:
1
2
3
executor_pool_a = ThreadPoolExecutor(max_workers=10, queue_size=50, name="ServiceA")
executor_pool_b = ThreadPoolExecutor(max_workers=5, queue_size=20, name="ServiceB") # Smaller
executor_pool_c = ThreadPoolExecutor(max_workers=10, queue_size=50, name="ServiceC")
Semaphore Isolation
Lightweight alternative to thread pools. Caller thread blocked on semaphore acquire (no thread creation).
1
2
3
4
5
6
7
8
semaphore_service_a = Semaphore(max_concurrent=20)
def call_service_a():
with semaphore_service_a: # Acquire permit
# If max_concurrent reached, block here
result = requests.get("http://service-a/...")
# Release permit when done
return result
Pros: No thread pool overhead, simpler, better for async/non-blocking code.
Cons: Caller thread blocks, not suitable for high-concurrency scenarios.
Resilience Strategy: Timeout + Retry + Jitter + Fallback
Combine multiple patterns for robust failure handling.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def resilient_call(service_name: str, request_func, fallback_func=None):
"""
Make a resilient call with timeout, retry, jitter, fallback.
"""
cb = circuit_breaker[service_name]
max_retries = 3
base_delay = 0.1 # 100ms
jitter_factor = 0.1 # 10% jitter
for attempt in range(max_retries):
try:
# Execute with circuit breaker
return cb.call(request_func, timeout=2.0)
except CircuitBreakerOpenException:
# Circuit is open; use fallback
if fallback_func:
return fallback_func()
raise
except TimeoutException:
if attempt < max_retries - 1:
# Exponential backoff + jitter
delay = base_delay * (2 ** attempt)
jitter = delay * jitter_factor * random.random()
time.sleep(delay + jitter)
else:
raise
except Exception as e:
# Other errors: don't retry
raise
raise RuntimeError(f"All {max_retries} attempts failed")
# Usage
def call_payment_service():
return requests.post("http://payment-service/charge", json=charge_request)
def fallback_payment():
# Queue charge for async processing
queue.put(charge_request)
return {"status": "queued"}
result = resilient_call("payment-service", call_payment_service, fallback_payment)
How Real Systems Use This
Netflix: Hystrix (Circuit Breaker + Bulkhead)
Architecture: Each service call wrapped in Hystrix command with circuit breaker + thread pool bulkhead.
Configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Each downstream service has its own command
class GetOrderCommand(HystrixCommand):
def __init__(self, order_id):
super().__init__(
group_key="OrderService",
command_key="GetOrder",
thread_pool_key="OrderService",
thread_pool=ThreadPoolConfig(core_size=20, queue_size=100),
circuit_breaker=CircuitBreakerConfig(
error_threshold_percentage=50,
request_volume_threshold=20,
sleep_window_in_milliseconds=5000
)
)
self.order_id = order_id
def run(self):
return requests.get(f"http://order-service/orders/{self.order_id}").json()
def get_fallback(self):
# Return cached version if available
return cache.get(f"order:{self.order_id}", None)
# Usage
result = GetOrderCommand(order_id=123).execute()
Result: One slow or failing service cannot bring down the entire Netflix platform. 300+ services call each other with confidence.
AWS: DynamoDB Rate Limiting + Backoff
Pattern: Client-side exponential backoff + jitter when receiving ProvisionedThroughputExceededException.
Why: DynamoDB’s rate limiting can cascade if all clients retry immediately. Jitter prevents thundering herd.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def write_to_dynamodb(table, item, max_retries=3):
"""Write with exponential backoff + jitter."""
for attempt in range(max_retries):
try:
table.put_item(Item=item)
return
except ClientError as e:
if e.response['Error']['Code'] == 'ProvisionedThroughputExceededException':
if attempt < max_retries - 1:
delay = 2 ** attempt # 1s, 2s, 4s
jitter = delay * 0.1 * random.random()
time.sleep(delay + jitter)
else:
raise
Google Cloud: Fallback and Graceful Degradation
Pattern: When downstream unavailable, serve stale data or partial results.
Example:
1
2
3
4
5
6
7
8
def get_recommendations():
"""Get recommendations, with fallback to popular items."""
try:
# Try ML model
return ml_model.recommend(user_id, timeout=1.0)
except (TimeoutException, ServiceDownException):
# Fallback to popular items (much cheaper)
return get_popular_items()
Result: User sees recommendations even if ML service is down.
References
- Release It! Design and Deploy Production-Ready Software – Michael Nygard (2nd ed., 2018) – Definitive book on circuit breakers, bulkheads, timeouts; includes Hystrix case study.
- Hystrix: Latency and Fault Tolerance for Distributed Systems – Netflix (Archived) – Original circuit breaker library documentation.
- ByteByteGo – Circuit Breaker Pattern – Visual explanation of state transitions.
- Building Microservices – Sam Newman, Ch. 8 – Coverage of resilience patterns in microservices.
- AWS Best Practices for Exponential Backoff – Jitter + backoff strategy for distributed systems.