Post

Model Deployment

Safely shipping models to production: Packaging, staged rollout (shadow mode, canary, A/B test), infrastructure, and strategies to minimize risk and measure impact.

Model Deployment

Safely shipping models to production: Packaging, staged rollout (shadow mode → canary → A/B test), infrastructure, and strategies to minimize risk and measure impact.

The Deployment Challenge

Most ML failures occur after deployment, not during development. Common causes:

  • Model serves on different hardware (different numeric precision)
  • Production data differs from training data (concept drift, distribution shift)
  • Dependencies change (Python version, library updates)
  • Features computed differently at serving time
  • Silent failures (model returns wrong predictions, nobody notices for weeks)

Goal: Ship models confidently with measurable business impact.


Deployment Modes

Choose based on latency budget and business requirements:

Mode Latency Cost Complexity Best For Risk
Real-time API <100ms High Medium Personalization (ads, recommendations) High latency impacts UX
Batch (scheduled) hours Low Low Offline predictions (daily scores, email) Stale predictions
Streaming <1s Medium High Time-sensitive (fraud, anomalies) Infrastructure complexity
Edge (on-device) <10ms Low High Mobile, IoT (no network calls) Model size, device power
Embedded (library) <1ms Low Low Simple models (feature transforms) Limited to simple logic

Real-Time API (Most Common)

1
2
3
4
5
6
7
8
9
User Request → Load Balancer → API Server (Flask/FastAPI)
                                    ↓
                            Feature Engineering
                                    ↓
                            Model Inference (TF Serving/BentoML)
                                    ↓
                            Format Response
                                    ↓
                            Return Prediction (<100ms)

Pros: Fresh predictions, flexible model updates Cons: Network latency, infrastructure cost, operational complexity

Batch Processing

1
2
3
4
5
6
7
8
9
10
11
Scheduled Job (e.g., 8am daily)
    ↓
Fetch data from data lake (S3, BigQuery)
    ↓
Run feature engineering at scale (Spark, Dataflow)
    ↓
Batch inference (thousands of samples at once)
    ↓
Write results to database
    ↓
Application reads cached predictions

Pros: Low cost, batch inference is efficient, easy to debug Cons: Predictions stale (predictions made 8am, used until tomorrow 8am)

Streaming (Apache Kafka)

1
2
3
4
5
6
7
Events stream in (Kafka)
    ↓
Real-time feature computation (Flink, Kafka Streams)
    ↓
Model inference on each event
    ↓
Route to action (fraud = decline, normal = approve)

Pros: Fresh predictions for every event Cons: Infrastructure complexity, harder to debug


Model Serialization

Models must be saved in a format reproducible across environments.

Format Framework Pros Cons Use Case
SavedModel TensorFlow Standard, optimized TF-only TensorFlow models
ONNX Framework-agnostic Works everywhere Some ops unsupported Cross-framework deployment
PyTorch JIT PyTorch Fast inference PyTorch-only PyTorch models
Pickle scikit-learn Simple Security risk, Python-only Sklearn, tree models
PMML Standard Interoperable Verbose Classical ML (trees, linear)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# TensorFlow
model.save('my_model')  # Saves: saved_model.pb, variables, assets

# PyTorch
torch.jit.script(model).save('model.pt')

# scikit-learn (pickle — avoid for untrusted sources)
import joblib
joblib.dump(model, 'model.pkl')

# ONNX (framework-agnostic)
import onnx
onnx_model = onnx.convert_sklearn(sklearn_model)
onnx.save_model(onnx_model, 'model.onnx')

Staged Rollout Strategy

Never deploy directly to 100% traffic. Risk: Bad model breaks production.

1. Shadow Mode (Risk: Zero)

New model runs in parallel, predictions NOT used by users:

1
2
3
4
5
6
7
User Request
    ↓
[Current Model] → Used to serve user
[New Model]    → Runs in background, predictions logged
    ↓
Log: current_pred=0.7, new_pred=0.65, user_clicked=1
     (compare after fact)

Duration: 1–2 weeks. Metrics:

  • Are predictions similar? (correlation > 0.9)
  • Any obvious errors? (predictions outside expected range)
  • Performance metrics similar? (AUC ~same)

Decision: If metrics look good, proceed to canary. If not, investigate.

2. Canary Release (Risk: Low)

New model serves small traffic percentage (1–5%), monitor closely:

1
2
3
4
5
6
7
8
Traffic Distribution:
  - 99% Current Model (proven)
  - 1% New Model (test)

Monitor:
  - Error rate (> 0.1% → roll back)
  - Latency P99 (> 2x → investigate)
  - User feedback (complaints?)

Duration: 1–7 days. Decision: If errors low and latency OK, increase to 10%, then 50%.

3. A/B Test (Risk: Measured)

50/50 split: Half users see current model, half see new model.

1
2
3
4
5
6
7
8
9
10
11
12
Randomization: bucket = hash(user_id) % 100
               if bucket < 50: use_new_model = True

Measure for 1 week:
  - Business metrics: CTR, revenue, engagement
  - Model metrics: AUC, latency
  - Statistical significance: t-test, p < 0.05

Decision:
  - New model better → 100% rollout
  - No difference → keep current (less risk)
  - New model worse → don't ship (analyze why)

Serving Frameworks

TensorFlow Serving

Optimized for TensorFlow models, handles versioning:

1
2
3
4
5
6
7
8
9
10
11
12
# Save model
model.save('models/ctr_model/1')

# Run TF Serving (containerized)
docker run -p 8500:8500 -v /models:/models \
  tensorflow/serving:latest \
  --model_config_file=/models/models.conf

# Client prediction
import requests
response = requests.post('http://localhost:8500/v1/models/ctr_model:predict',
  json={'instances': [{'age': 30, 'income': 100000}]})

BentoML (Framework-Agnostic)

Works with any framework (PyTorch, scikit-learn, XGBoost):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import bentoml
from bentoml.io import JSON

@bentoml.service
class CTRPredictorService:
    def __init__(self):
        self.model = bentoml.sklearn.get("ctr_model:latest").to_runner()

    @bentoml.api(input=JSON(), output=JSON())
    def predict(self, request):
        return {"prediction": self.model.predict.run(request)}

# Deploy
bentoml serve ctr_predictor:latest --port 3000

# Client
import requests
response = requests.post('http://localhost:3000/predict',
  json={'age': 30, 'income': 100000})
print(response.json())  # {'prediction': 0.725}

KServe (Kubernetes-Native)

Serverless model serving on Kubernetes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ctr-predictor
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/ctr_model
      resources:
        requests:
          cpu: 100m
          memory: 1Gi
        limits:
          cpu: 200m
          memory: 2Gi
  canaryTrafficPercent: 10  # Canary 10% traffic
  canarySpec:
    sklearn:
      storageUri: gs://my-bucket/ctr_model_v2

Latency Optimization

Latency < target budget is critical for real-time systems. Strategies:

Batching

Inference on multiple samples at once (amortizes overhead):

1
2
3
4
5
# Per-request: 50ms (preprocessing 5ms + model 40ms + postprocessing 5ms)
# Batch 32: 52ms total (preprocessing 5ms + model 30ms + postprocessing 5ms)
# Per-request latency: 52ms / 32 = 1.6ms per sample

# Implementation: collect requests, predict, return in order

Model Compression

Reduce model size without sacrificing accuracy:

Quantization: 32-bit floats → 8-bit integers (4x smaller, 2x faster)

1
2
3
4
5
6
7
8
9
10
# TensorFlow Lite
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model('model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)
# Result: 4MB → 1MB, inference 2x faster

Pruning: Remove unnecessary weights (model has redundancy) Distillation: Train small “student” model on large “teacher” predictions

Caching

Cache predictions for identical requests (e.g., popular items queried repeatedly):

1
2
3
4
5
6
7
8
from functools import lru_cache

@lru_cache(maxsize=100000)
def predict(user_id, item_id):
    return model.predict([[user_id, item_id]])

# First call: 50ms (cache miss, runs inference)
# Second call same user/item: 1ms (cache hit, return immediately)

Infrastructure

Containerization (Docker)

Ensure reproducibility across dev, staging, production:

1
2
3
4
5
6
7
8
9
10
11
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model.pkl .
COPY app.py .

EXPOSE 5000
CMD ["python", "app.py"]
1
2
docker build -t ctr-predictor:v1 .
docker run -p 5000:5000 ctr-predictor:v1

Orchestration (Kubernetes)

Manage replicas, autoscaling, rolling updates:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ctr-predictor
spec:
  replicas: 3  # Run 3 replicas
  selector:
    matchLabels:
      app: ctr-predictor
  template:
    metadata:
      labels:
        app: ctr-predictor
    spec:
      containers:
      - name: predictor
        image: ctr-predictor:v1
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ctr-predictor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ctr-predictor
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale if CPU > 70%

Key Properties by Deployment Mode

Property Real-Time Batch Streaming Edge
Latency 50–200ms hours <1s <10ms
Cost/QPS $0.001–0.01/prediction $0.0001/prediction $0.0005/prediction Free (device)
Model Update Hot swap (KServe) Daily/weekly Hourly Manual update
Monitoring Complex (per-request) Simple (batch stats) Complex (stream state) Simple (offline)
Data Freshness Fresh (real-time) Stale (up to 24h) Fresh Fresh

Implementation Example: Flask API with Canary Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# app.py
from flask import Flask, request, jsonify
import numpy as np
import mlflow.pyfunc
import hashlib

app = Flask(__name__)

# Load models
current_model = mlflow.pyfunc.load_model('models:/ctr_model/production')
canary_model = mlflow.pyfunc.load_model('models:/ctr_model/staging')

CANARY_TRAFFIC_PERCENT = 5  # Route 5% to new model

def get_model(user_id):
    """Route to current or canary model based on user hash."""
    hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
    if (hash_val % 100) < CANARY_TRAFFIC_PERCENT:
        return 'canary', canary_model
    return 'current', current_model

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    user_id = data['user_id']
    features = np.array([[data['age'], data['income']]])

    model_name, model = get_model(user_id)
    prediction = model.predict(features)[0]

    # Log for monitoring
    log_request(user_id, model_name, prediction, features)

    return jsonify({
        'prediction': float(prediction),
        'model': model_name
    })

def log_request(user_id, model, prediction, features):
    """Log predictions for monitoring and canary evaluation."""
    # Write to Kafka/BigQuery for analysis
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
1
2
3
4
5
6
7
8
9
# Build container
docker build -t ctr-predictor:v1 .

# Deploy to Kubernetes with canary
kubectl apply -f deployment.yaml

# Monitor canary
kubectl logs -l app=ctr-predictor --tail=100 | grep canary
# Gradually increase CANARY_TRAFFIC_PERCENT (5% → 10% → 50% → 100%)

How Real Companies Use This

Netflix’s Weekly Model Deployment (200+ Models Per Week): Netflix deploys 200+ ML models per week across personalization, search ranking, and content delivery systems. Deployment pipeline: model registry → canary (1% traffic for 24 hours) → shadow mode (run old + new model in parallel, log both predictions) → full rollout. Model versioning: maintains last 5 versions in production for instant rollback if issues emerge. A/B testing: mandatory for all recommendation changes, measuring watch hours (business metric) over 2 weeks. Serving latency SLA: <50ms p99 for personalization API (latency >100ms causes user frustration). Infrastructure: Kubernetes-based with auto-scaling (scale up to 1k replicas during peak hours, down to 100 during off-peak). Model compression: quantization (float32 → int8) reduces model size from 500MB to 125MB, cutting serving cost 60%. Failure handling: if canary shows error rate >0.1% or latency >2x baseline, automatic rollback triggers. Timeline from approval to 100% deployment: 5–7 days.

Airbnb’s Smart Pricing Deployment with Revenue Guards: Airbnb’s Smart Pricing model deployment focuses on protecting host and platform revenue. Deployment strategy: blue/green (run current and new model in parallel, switch traffic once new model validated). Canary: new pricing model applied to 1% of listings (~1k properties) for 2 weeks, monitoring revenue per listing. Shadow mode: for weeks 3–4, show both old and new prices to listings (internally), measure guest acceptance. Full rollout: once revenue-per-listing improves >2% with <0.5% host complaints, roll out to 100k listings. Serving latency SLA: <50ms p99 for pricing API. Model compression: quantization + pruning reduce model from 200MB to 40MB. Monitoring: revenue per listing tracked hourly, automatic rollback if drops >2% within 4 hours. A/B testing: concurrent test on 100k listings (50k new model, 50k old) for final validation before full rollout. Total deployment timeline: 8–12 weeks (2 weeks canary + 2 weeks shadow + 2 weeks A/B + 2 weeks gradual rollout).

Google’s TensorFlow Serving at 1M+ QPS: Google deploys ranking models via TensorFlow Serving handling 1M+ inference requests/second across Search, Maps, Gmail, YouTube. Deployment stages: model passes validation gates (metrics must not regress vs champion), A/B test configuration is declarative (YAML), rollback takes <30 seconds. Serving infrastructure: distributed load balancing across 10k+ servers, batch inference (collect 32 requests, infer together for 2x throughput), model quantization (int8 for mobile). Latency optimization: caching (popular queries’ results cached for 1 hour), hardware acceleration (TPU for large BERT models). Model versioning: 5 recent versions always in memory for A/B testing. A/B testing strategy: 1% canary (1% traffic) for 1 week, then 5% for 1 week, then full rollout if metrics improve. Failure handling: automated rollback if error rate exceeds threshold. Deployment frequency: 100+ new ranking models per week, with full pipeline automation.

Lyft’s Model Deployment with Flyte Orchestration: Lyft uses Flyte (open-source workflow orchestration) for ML deployment, handling 10M+ trip predictions daily (ETA, driver assignment, surge pricing). Deployment pipeline: model registry (ML-flow) → containerization (Docker) → orchestration (Kubernetes) → monitoring (Prometheus + Grafana). Canary strategy: new model tested on 1% of rides for 4 hours, monitoring latency (p99 <200ms), accuracy (compared to driver feedback), cost (inference cost tracked). Shadow mode: new model runs on all traffic for 24 hours, predictions logged but not used (comparison with old model). A/B test: 50/50 split on 100k rides for 7 days, measuring ride completion rate, driver satisfaction. Serving latency: <100ms p99 critical (real-time driver matching). Model size: quantized models (10–50MB) enable edge deployment on mobile. Failure handling: automatic rollback if driver acceptance rate drops >5%. Timeline: 2–4 weeks per model deployment.

Stripe’s Fraud Model Deployment with Immediate Rollback: Stripe deploys fraud detection models with hair-trigger rollback due to high false positive cost. Deployment: blue/green with continuous monitoring. Canary: new fraud model on 0.1% of transactions (~10k txns/hour, enough to evaluate false positive rate) for 24 hours. SLA: false positive rate <0.1% (1 in 1000 legitimate transactions flagged). Shadow mode: 1 week running new model on all traffic, comparing predictions. A/B test: 50/50 split for 3 days, measuring decline rate (% of legitimate txns declined by new model) and fraud catch rate. Serving latency: <50ms p99 (payment approval waits for decision). Automatic rollback: triggered if false positive rate spikes 2x baseline or latency exceeds 100ms. Infrastructure: redundant serving (multi-region) for disaster recovery. Timeline: 2–4 weeks from approval to full deployment, with continuous monitoring thereafter.


References

This post is licensed under CC BY 4.0 by the author.