Model Deployment
Safely shipping models to production: Packaging, staged rollout (shadow mode, canary, A/B test), infrastructure, and strategies to minimize risk and measure impact.
Safely shipping models to production: Packaging, staged rollout (shadow mode → canary → A/B test), infrastructure, and strategies to minimize risk and measure impact.
The Deployment Challenge
Most ML failures occur after deployment, not during development. Common causes:
- Model serves on different hardware (different numeric precision)
- Production data differs from training data (concept drift, distribution shift)
- Dependencies change (Python version, library updates)
- Features computed differently at serving time
- Silent failures (model returns wrong predictions, nobody notices for weeks)
Goal: Ship models confidently with measurable business impact.
Deployment Modes
Choose based on latency budget and business requirements:
| Mode | Latency | Cost | Complexity | Best For | Risk |
|---|---|---|---|---|---|
| Real-time API | <100ms | High | Medium | Personalization (ads, recommendations) | High latency impacts UX |
| Batch (scheduled) | hours | Low | Low | Offline predictions (daily scores, email) | Stale predictions |
| Streaming | <1s | Medium | High | Time-sensitive (fraud, anomalies) | Infrastructure complexity |
| Edge (on-device) | <10ms | Low | High | Mobile, IoT (no network calls) | Model size, device power |
| Embedded (library) | <1ms | Low | Low | Simple models (feature transforms) | Limited to simple logic |
Real-Time API (Most Common)
1
2
3
4
5
6
7
8
9
User Request → Load Balancer → API Server (Flask/FastAPI)
↓
Feature Engineering
↓
Model Inference (TF Serving/BentoML)
↓
Format Response
↓
Return Prediction (<100ms)
Pros: Fresh predictions, flexible model updates Cons: Network latency, infrastructure cost, operational complexity
Batch Processing
1
2
3
4
5
6
7
8
9
10
11
Scheduled Job (e.g., 8am daily)
↓
Fetch data from data lake (S3, BigQuery)
↓
Run feature engineering at scale (Spark, Dataflow)
↓
Batch inference (thousands of samples at once)
↓
Write results to database
↓
Application reads cached predictions
Pros: Low cost, batch inference is efficient, easy to debug Cons: Predictions stale (predictions made 8am, used until tomorrow 8am)
Streaming (Apache Kafka)
1
2
3
4
5
6
7
Events stream in (Kafka)
↓
Real-time feature computation (Flink, Kafka Streams)
↓
Model inference on each event
↓
Route to action (fraud = decline, normal = approve)
Pros: Fresh predictions for every event Cons: Infrastructure complexity, harder to debug
Model Serialization
Models must be saved in a format reproducible across environments.
| Format | Framework | Pros | Cons | Use Case |
|---|---|---|---|---|
| SavedModel | TensorFlow | Standard, optimized | TF-only | TensorFlow models |
| ONNX | Framework-agnostic | Works everywhere | Some ops unsupported | Cross-framework deployment |
| PyTorch JIT | PyTorch | Fast inference | PyTorch-only | PyTorch models |
| Pickle | scikit-learn | Simple | Security risk, Python-only | Sklearn, tree models |
| PMML | Standard | Interoperable | Verbose | Classical ML (trees, linear) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# TensorFlow
model.save('my_model') # Saves: saved_model.pb, variables, assets
# PyTorch
torch.jit.script(model).save('model.pt')
# scikit-learn (pickle — avoid for untrusted sources)
import joblib
joblib.dump(model, 'model.pkl')
# ONNX (framework-agnostic)
import onnx
onnx_model = onnx.convert_sklearn(sklearn_model)
onnx.save_model(onnx_model, 'model.onnx')
Staged Rollout Strategy
Never deploy directly to 100% traffic. Risk: Bad model breaks production.
1. Shadow Mode (Risk: Zero)
New model runs in parallel, predictions NOT used by users:
1
2
3
4
5
6
7
User Request
↓
[Current Model] → Used to serve user
[New Model] → Runs in background, predictions logged
↓
Log: current_pred=0.7, new_pred=0.65, user_clicked=1
(compare after fact)
Duration: 1–2 weeks. Metrics:
- Are predictions similar? (correlation > 0.9)
- Any obvious errors? (predictions outside expected range)
- Performance metrics similar? (AUC ~same)
Decision: If metrics look good, proceed to canary. If not, investigate.
2. Canary Release (Risk: Low)
New model serves small traffic percentage (1–5%), monitor closely:
1
2
3
4
5
6
7
8
Traffic Distribution:
- 99% Current Model (proven)
- 1% New Model (test)
Monitor:
- Error rate (> 0.1% → roll back)
- Latency P99 (> 2x → investigate)
- User feedback (complaints?)
Duration: 1–7 days. Decision: If errors low and latency OK, increase to 10%, then 50%.
3. A/B Test (Risk: Measured)
50/50 split: Half users see current model, half see new model.
1
2
3
4
5
6
7
8
9
10
11
12
Randomization: bucket = hash(user_id) % 100
if bucket < 50: use_new_model = True
Measure for 1 week:
- Business metrics: CTR, revenue, engagement
- Model metrics: AUC, latency
- Statistical significance: t-test, p < 0.05
Decision:
- New model better → 100% rollout
- No difference → keep current (less risk)
- New model worse → don't ship (analyze why)
Serving Frameworks
TensorFlow Serving
Optimized for TensorFlow models, handles versioning:
1
2
3
4
5
6
7
8
9
10
11
12
# Save model
model.save('models/ctr_model/1')
# Run TF Serving (containerized)
docker run -p 8500:8500 -v /models:/models \
tensorflow/serving:latest \
--model_config_file=/models/models.conf
# Client prediction
import requests
response = requests.post('http://localhost:8500/v1/models/ctr_model:predict',
json={'instances': [{'age': 30, 'income': 100000}]})
BentoML (Framework-Agnostic)
Works with any framework (PyTorch, scikit-learn, XGBoost):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import bentoml
from bentoml.io import JSON
@bentoml.service
class CTRPredictorService:
def __init__(self):
self.model = bentoml.sklearn.get("ctr_model:latest").to_runner()
@bentoml.api(input=JSON(), output=JSON())
def predict(self, request):
return {"prediction": self.model.predict.run(request)}
# Deploy
bentoml serve ctr_predictor:latest --port 3000
# Client
import requests
response = requests.post('http://localhost:3000/predict',
json={'age': 30, 'income': 100000})
print(response.json()) # {'prediction': 0.725}
KServe (Kubernetes-Native)
Serverless model serving on Kubernetes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: ctr-predictor
spec:
predictor:
sklearn:
storageUri: gs://my-bucket/ctr_model
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 200m
memory: 2Gi
canaryTrafficPercent: 10 # Canary 10% traffic
canarySpec:
sklearn:
storageUri: gs://my-bucket/ctr_model_v2
Latency Optimization
Latency < target budget is critical for real-time systems. Strategies:
Batching
Inference on multiple samples at once (amortizes overhead):
1
2
3
4
5
# Per-request: 50ms (preprocessing 5ms + model 40ms + postprocessing 5ms)
# Batch 32: 52ms total (preprocessing 5ms + model 30ms + postprocessing 5ms)
# Per-request latency: 52ms / 32 = 1.6ms per sample
# Implementation: collect requests, predict, return in order
Model Compression
Reduce model size without sacrificing accuracy:
Quantization: 32-bit floats → 8-bit integers (4x smaller, 2x faster)
1
2
3
4
5
6
7
8
9
10
# TensorFlow Lite
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model('model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 4MB → 1MB, inference 2x faster
Pruning: Remove unnecessary weights (model has redundancy) Distillation: Train small “student” model on large “teacher” predictions
Caching
Cache predictions for identical requests (e.g., popular items queried repeatedly):
1
2
3
4
5
6
7
8
from functools import lru_cache
@lru_cache(maxsize=100000)
def predict(user_id, item_id):
return model.predict([[user_id, item_id]])
# First call: 50ms (cache miss, runs inference)
# Second call same user/item: 1ms (cache hit, return immediately)
Infrastructure
Containerization (Docker)
Ensure reproducibility across dev, staging, production:
1
2
3
4
5
6
7
8
9
10
11
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY model.pkl .
COPY app.py .
EXPOSE 5000
CMD ["python", "app.py"]
1
2
docker build -t ctr-predictor:v1 .
docker run -p 5000:5000 ctr-predictor:v1
Orchestration (Kubernetes)
Manage replicas, autoscaling, rolling updates:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps/v1
kind: Deployment
metadata:
name: ctr-predictor
spec:
replicas: 3 # Run 3 replicas
selector:
matchLabels:
app: ctr-predictor
template:
metadata:
labels:
app: ctr-predictor
spec:
containers:
- name: predictor
image: ctr-predictor:v1
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ctr-predictor-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ctr-predictor
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale if CPU > 70%
Key Properties by Deployment Mode
| Property | Real-Time | Batch | Streaming | Edge |
|---|---|---|---|---|
| Latency | 50–200ms | hours | <1s | <10ms |
| Cost/QPS | $0.001–0.01/prediction | $0.0001/prediction | $0.0005/prediction | Free (device) |
| Model Update | Hot swap (KServe) | Daily/weekly | Hourly | Manual update |
| Monitoring | Complex (per-request) | Simple (batch stats) | Complex (stream state) | Simple (offline) |
| Data Freshness | Fresh (real-time) | Stale (up to 24h) | Fresh | Fresh |
Implementation Example: Flask API with Canary Deployment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# app.py
from flask import Flask, request, jsonify
import numpy as np
import mlflow.pyfunc
import hashlib
app = Flask(__name__)
# Load models
current_model = mlflow.pyfunc.load_model('models:/ctr_model/production')
canary_model = mlflow.pyfunc.load_model('models:/ctr_model/staging')
CANARY_TRAFFIC_PERCENT = 5 # Route 5% to new model
def get_model(user_id):
"""Route to current or canary model based on user hash."""
hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
if (hash_val % 100) < CANARY_TRAFFIC_PERCENT:
return 'canary', canary_model
return 'current', current_model
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
user_id = data['user_id']
features = np.array([[data['age'], data['income']]])
model_name, model = get_model(user_id)
prediction = model.predict(features)[0]
# Log for monitoring
log_request(user_id, model_name, prediction, features)
return jsonify({
'prediction': float(prediction),
'model': model_name
})
def log_request(user_id, model, prediction, features):
"""Log predictions for monitoring and canary evaluation."""
# Write to Kafka/BigQuery for analysis
pass
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
1
2
3
4
5
6
7
8
9
# Build container
docker build -t ctr-predictor:v1 .
# Deploy to Kubernetes with canary
kubectl apply -f deployment.yaml
# Monitor canary
kubectl logs -l app=ctr-predictor --tail=100 | grep canary
# Gradually increase CANARY_TRAFFIC_PERCENT (5% → 10% → 50% → 100%)
How Real Companies Use This
Netflix’s Weekly Model Deployment (200+ Models Per Week): Netflix deploys 200+ ML models per week across personalization, search ranking, and content delivery systems. Deployment pipeline: model registry → canary (1% traffic for 24 hours) → shadow mode (run old + new model in parallel, log both predictions) → full rollout. Model versioning: maintains last 5 versions in production for instant rollback if issues emerge. A/B testing: mandatory for all recommendation changes, measuring watch hours (business metric) over 2 weeks. Serving latency SLA: <50ms p99 for personalization API (latency >100ms causes user frustration). Infrastructure: Kubernetes-based with auto-scaling (scale up to 1k replicas during peak hours, down to 100 during off-peak). Model compression: quantization (float32 → int8) reduces model size from 500MB to 125MB, cutting serving cost 60%. Failure handling: if canary shows error rate >0.1% or latency >2x baseline, automatic rollback triggers. Timeline from approval to 100% deployment: 5–7 days.
Airbnb’s Smart Pricing Deployment with Revenue Guards: Airbnb’s Smart Pricing model deployment focuses on protecting host and platform revenue. Deployment strategy: blue/green (run current and new model in parallel, switch traffic once new model validated). Canary: new pricing model applied to 1% of listings (~1k properties) for 2 weeks, monitoring revenue per listing. Shadow mode: for weeks 3–4, show both old and new prices to listings (internally), measure guest acceptance. Full rollout: once revenue-per-listing improves >2% with <0.5% host complaints, roll out to 100k listings. Serving latency SLA: <50ms p99 for pricing API. Model compression: quantization + pruning reduce model from 200MB to 40MB. Monitoring: revenue per listing tracked hourly, automatic rollback if drops >2% within 4 hours. A/B testing: concurrent test on 100k listings (50k new model, 50k old) for final validation before full rollout. Total deployment timeline: 8–12 weeks (2 weeks canary + 2 weeks shadow + 2 weeks A/B + 2 weeks gradual rollout).
Google’s TensorFlow Serving at 1M+ QPS: Google deploys ranking models via TensorFlow Serving handling 1M+ inference requests/second across Search, Maps, Gmail, YouTube. Deployment stages: model passes validation gates (metrics must not regress vs champion), A/B test configuration is declarative (YAML), rollback takes <30 seconds. Serving infrastructure: distributed load balancing across 10k+ servers, batch inference (collect 32 requests, infer together for 2x throughput), model quantization (int8 for mobile). Latency optimization: caching (popular queries’ results cached for 1 hour), hardware acceleration (TPU for large BERT models). Model versioning: 5 recent versions always in memory for A/B testing. A/B testing strategy: 1% canary (1% traffic) for 1 week, then 5% for 1 week, then full rollout if metrics improve. Failure handling: automated rollback if error rate exceeds threshold. Deployment frequency: 100+ new ranking models per week, with full pipeline automation.
Lyft’s Model Deployment with Flyte Orchestration: Lyft uses Flyte (open-source workflow orchestration) for ML deployment, handling 10M+ trip predictions daily (ETA, driver assignment, surge pricing). Deployment pipeline: model registry (ML-flow) → containerization (Docker) → orchestration (Kubernetes) → monitoring (Prometheus + Grafana). Canary strategy: new model tested on 1% of rides for 4 hours, monitoring latency (p99 <200ms), accuracy (compared to driver feedback), cost (inference cost tracked). Shadow mode: new model runs on all traffic for 24 hours, predictions logged but not used (comparison with old model). A/B test: 50/50 split on 100k rides for 7 days, measuring ride completion rate, driver satisfaction. Serving latency: <100ms p99 critical (real-time driver matching). Model size: quantized models (10–50MB) enable edge deployment on mobile. Failure handling: automatic rollback if driver acceptance rate drops >5%. Timeline: 2–4 weeks per model deployment.
Stripe’s Fraud Model Deployment with Immediate Rollback: Stripe deploys fraud detection models with hair-trigger rollback due to high false positive cost. Deployment: blue/green with continuous monitoring. Canary: new fraud model on 0.1% of transactions (~10k txns/hour, enough to evaluate false positive rate) for 24 hours. SLA: false positive rate <0.1% (1 in 1000 legitimate transactions flagged). Shadow mode: 1 week running new model on all traffic, comparing predictions. A/B test: 50/50 split for 3 days, measuring decline rate (% of legitimate txns declined by new model) and fraud catch rate. Serving latency: <50ms p99 (payment approval waits for decision). Automatic rollback: triggered if false positive rate spikes 2x baseline or latency exceeds 100ms. Infrastructure: redundant serving (multi-region) for disaster recovery. Timeline: 2–4 weeks from approval to full deployment, with continuous monitoring thereafter.
References
- Designing Machine Learning Systems (Chip Huyen) — Comprehensive deployment chapter
- Hidden Technical Debt in Machine Learning Systems (Google, 2015) — Why ML systems fail
- TensorFlow Serving Documentation — Production TF model serving
- BentoML Documentation — Framework-agnostic serving
- KServe Documentation — Kubernetes-native serving
- Chip Huyen: ML Systems Design – Deployment — Deployment strategies
- Introduction to MLOps (Burkov) — Operations perspective
- Continuous Integration/Deployment for Machine Learning (Sculley et al.) — MLOps best practices