Model Deployment

Safely shipping models to production: Packaging, staged rollout (shadow mode, canary, A/B test), infrastructure, and strategies to minimize risk and measure impact.

Posted Dec 1, 2025

12 min read

Model Deployment

Safely shipping models to production: Packaging, staged rollout (shadow mode → canary → A/B test), infrastructure, and strategies to minimize risk and measure impact.

The Deployment Challenge

Most ML failures occur after deployment, not during development. Common causes:

Model serves on different hardware (different numeric precision)
Production data differs from training data (concept drift, distribution shift)
Dependencies change (Python version, library updates)
Features computed differently at serving time
Silent failures (model returns wrong predictions, nobody notices for weeks)

Goal: Ship models confidently with measurable business impact.

Deployment Modes

Choose based on latency budget and business requirements:

Mode	Latency	Cost	Complexity	Best For	Risk
Real-time API	<100ms	High	Medium	Personalization (ads, recommendations)	High latency impacts UX
Batch (scheduled)	hours	Low	Low	Offline predictions (daily scores, email)	Stale predictions
Streaming	<1s	Medium	High	Time-sensitive (fraud, anomalies)	Infrastructure complexity
Edge (on-device)	<10ms	Low	High	Mobile, IoT (no network calls)	Model size, device power
Embedded (library)	<1ms	Low	Low	Simple models (feature transforms)	Limited to simple logic

Real-Time API (Most Common)

User Request → Load Balancer → API Server (Flask/FastAPI)
                                    ↓
                            Feature Engineering
                                    ↓
                            Model Inference (TF Serving/BentoML)
                                    ↓
                            Format Response
                                    ↓
                            Return Prediction (<100ms)

Pros: Fresh predictions, flexible model updates Cons: Network latency, infrastructure cost, operational complexity

Batch Processing

Scheduled Job (e.g., 8am daily)
    ↓
Fetch data from data lake (S3, BigQuery)
    ↓
Run feature engineering at scale (Spark, Dataflow)
    ↓
Batch inference (thousands of samples at once)
    ↓
Write results to database
    ↓
Application reads cached predictions

Pros: Low cost, batch inference is efficient, easy to debug Cons: Predictions stale (predictions made 8am, used until tomorrow 8am)

Streaming (Apache Kafka)

Events stream in (Kafka)
    ↓
Real-time feature computation (Flink, Kafka Streams)
    ↓
Model inference on each event
    ↓
Route to action (fraud = decline, normal = approve)

Pros: Fresh predictions for every event Cons: Infrastructure complexity, harder to debug

Model Serialization

Models must be saved in a format reproducible across environments.

Format	Framework	Pros	Cons	Use Case
SavedModel	TensorFlow	Standard, optimized	TF-only	TensorFlow models
ONNX	Framework-agnostic	Works everywhere	Some ops unsupported	Cross-framework deployment
PyTorch JIT	PyTorch	Fast inference	PyTorch-only	PyTorch models
Pickle	scikit-learn	Simple	Security risk, Python-only	Sklearn, tree models
PMML	Standard	Interoperable	Verbose	Classical ML (trees, linear)

        
      
# TensorFlow
model.save('my_model')  # Saves: saved_model.pb, variables, assets

# PyTorch
torch.jit.script(model).save('model.pt')

# scikit-learn (pickle — avoid for untrusted sources)
import joblib
joblib.dump(model, 'model.pkl')

# ONNX (framework-agnostic)
import onnx
onnx_model = onnx.convert_sklearn(sklearn_model)
onnx.save_model(onnx_model, 'model.onnx')

Staged Rollout Strategy

Never deploy directly to 100% traffic. Risk: Bad model breaks production.

1. Shadow Mode (Risk: Zero)

New model runs in parallel, predictions NOT used by users:

User Request
    ↓
[Current Model] → Used to serve user
[New Model]    → Runs in background, predictions logged
    ↓
Log: current_pred=0.7, new_pred=0.65, user_clicked=1
     (compare after fact)

Duration: 1–2 weeks. Metrics:

Are predictions similar? (correlation > 0.9)
Any obvious errors? (predictions outside expected range)
Performance metrics similar? (AUC ~same)

Decision: If metrics look good, proceed to canary. If not, investigate.

2. Canary Release (Risk: Low)

New model serves small traffic percentage (1–5%), monitor closely:

Traffic Distribution:
  - 99% Current Model (proven)
  - 1% New Model (test)

Monitor:
  - Error rate (> 0.1% → roll back)
  - Latency P99 (> 2x → investigate)
  - User feedback (complaints?)

Duration: 1–7 days. Decision: If errors low and latency OK, increase to 10%, then 50%.

3. A/B Test (Risk: Measured)

50/50 split: Half users see current model, half see new model.

Randomization: bucket = hash(user_id) % 100
               if bucket < 50: use_new_model = True

Measure for 1 week:
  - Business metrics: CTR, revenue, engagement
  - Model metrics: AUC, latency
  - Statistical significance: t-test, p < 0.05

Decision:
  - New model better → 100% rollout
  - No difference → keep current (less risk)
  - New model worse → don't ship (analyze why)

Serving Frameworks

TensorFlow Serving

Optimized for TensorFlow models, handles versioning:

        
      
# Save model
model.save('models/ctr_model/1')

# Run TF Serving (containerized)
docker run -p 8500:8500 -v /models:/models \
  tensorflow/serving:latest \
  --model_config_file=/models/models.conf

# Client prediction
import requests
response = requests.post('http://localhost:8500/v1/models/ctr_model:predict',
  json={'instances': [{'age': 30, 'income': 100000}]})

BentoML (Framework-Agnostic)

Works with any framework (PyTorch, scikit-learn, XGBoost):

        
      
import bentoml
from bentoml.io import JSON

@bentoml.service
class CTRPredictorService:
    def __init__(self):
        self.model = bentoml.sklearn.get("ctr_model:latest").to_runner()

    @bentoml.api(input=JSON(), output=JSON())
    def predict(self, request):
        return {"prediction": self.model.predict.run(request)}

# Deploy
bentoml serve ctr_predictor:latest --port 3000

# Client
import requests
response = requests.post('http://localhost:3000/predict',
  json={'age': 30, 'income': 100000})
print(response.json())  # {'prediction': 0.725}

KServe (Kubernetes-Native)

Serverless model serving on Kubernetes:

        
      
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ctr-predictor
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/ctr_model
      resources:
        requests:
          cpu: 100m
          memory: 1Gi
        limits:
          cpu: 200m
          memory: 2Gi
  canaryTrafficPercent: 10  # Canary 10% traffic
  canarySpec:
    sklearn:
      storageUri: gs://my-bucket/ctr_model_v2

Latency Optimization

Latency < target budget is critical for real-time systems. Strategies:

Batching

Inference on multiple samples at once (amortizes overhead):

# Per-request: 50ms (preprocessing 5ms + model 40ms + postprocessing 5ms)
# Batch 32: 52ms total (preprocessing 5ms + model 30ms + postprocessing 5ms)
# Per-request latency: 52ms / 32 = 1.6ms per sample

# Implementation: collect requests, predict, return in order

Model Compression

Reduce model size without sacrificing accuracy:

Quantization: 32-bit floats → 8-bit integers (4x smaller, 2x faster)

        
      
# TensorFlow Lite
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model('model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)
# Result: 4MB → 1MB, inference 2x faster

Pruning: Remove unnecessary weights (model has redundancy) Distillation: Train small “student” model on large “teacher” predictions

Caching

Cache predictions for identical requests (e.g., popular items queried repeatedly):

        
      
from functools import lru_cache

@lru_cache(maxsize=100000)
def predict(user_id, item_id):
    return model.predict([[user_id, item_id]])

# First call: 50ms (cache miss, runs inference)
# Second call same user/item: 1ms (cache hit, return immediately)

Infrastructure

Containerization (Docker)

Ensure reproducibility across dev, staging, production:

        
      
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model.pkl .
COPY app.py .

EXPOSE 5000
CMD ["python", "app.py"]

        
docker build -t ctr-predictor:v1 .
docker run -p 5000:5000 ctr-predictor:v1

Orchestration (Kubernetes)

Manage replicas, autoscaling, rolling updates:

        
      
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ctr-predictor
spec:
  replicas: 3  # Run 3 replicas
  selector:
    matchLabels:
      app: ctr-predictor
  template:
    metadata:
      labels:
        app: ctr-predictor
    spec:
      containers:
      - name: predictor
        image: ctr-predictor:v1
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ctr-predictor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ctr-predictor
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale if CPU > 70%

Key Properties by Deployment Mode

Property	Real-Time	Batch	Streaming	Edge
Latency	50–200ms	hours	<1s	<10ms
Cost/QPS	$0.001–0.01/prediction	$0.0001/prediction	$0.0005/prediction	Free (device)
Model Update	Hot swap (KServe)	Daily/weekly	Hourly	Manual update
Monitoring	Complex (per-request)	Simple (batch stats)	Complex (stream state)	Simple (offline)
Data Freshness	Fresh (real-time)	Stale (up to 24h)	Fresh	Fresh

Implementation Example: Flask API with Canary Deployment

        
      
# app.py
from flask import Flask, request, jsonify
import numpy as np
import mlflow.pyfunc
import hashlib

app = Flask(__name__)

# Load models
current_model = mlflow.pyfunc.load_model('models:/ctr_model/production')
canary_model = mlflow.pyfunc.load_model('models:/ctr_model/staging')

CANARY_TRAFFIC_PERCENT = 5  # Route 5% to new model

def get_model(user_id):
    """Route to current or canary model based on user hash."""
    hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
    if (hash_val % 100) < CANARY_TRAFFIC_PERCENT:
        return 'canary', canary_model
    return 'current', current_model

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    user_id = data['user_id']
    features = np.array([[data['age'], data['income']]])

    model_name, model = get_model(user_id)
    prediction = model.predict(features)[0]

    # Log for monitoring
    log_request(user_id, model_name, prediction, features)

    return jsonify({
        'prediction': float(prediction),
        'model': model_name
    })

def log_request(user_id, model, prediction, features):
    """Log predictions for monitoring and canary evaluation."""
    # Write to Kafka/BigQuery for analysis
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

        
      
# Build container
docker build -t ctr-predictor:v1 .

# Deploy to Kubernetes with canary
kubectl apply -f deployment.yaml

# Monitor canary
kubectl logs -l app=ctr-predictor --tail=100 | grep canary
# Gradually increase CANARY_TRAFFIC_PERCENT (5% → 10% → 50% → 100%)

How Real Companies Use This

Netflix’s Weekly Model Deployment (200+ Models Per Week): Netflix deploys 200+ ML models per week across personalization, search ranking, and content delivery systems. Deployment pipeline: model registry → canary (1% traffic for 24 hours) → shadow mode (run old + new model in parallel, log both predictions) → full rollout. Model versioning: maintains last 5 versions in production for instant rollback if issues emerge. A/B testing: mandatory for all recommendation changes, measuring watch hours (business metric) over 2 weeks. Serving latency SLA: <50ms p99 for personalization API (latency >100ms causes user frustration). Infrastructure: Kubernetes-based with auto-scaling (scale up to 1k replicas during peak hours, down to 100 during off-peak). Model compression: quantization (float32 → int8) reduces model size from 500MB to 125MB, cutting serving cost 60%. Failure handling: if canary shows error rate >0.1% or latency >2x baseline, automatic rollback triggers. Timeline from approval to 100% deployment: 5–7 days.

Airbnb’s Smart Pricing Deployment with Revenue Guards: Airbnb’s Smart Pricing model deployment focuses on protecting host and platform revenue. Deployment strategy: blue/green (run current and new model in parallel, switch traffic once new model validated). Canary: new pricing model applied to 1% of listings (~1k properties) for 2 weeks, monitoring revenue per listing. Shadow mode: for weeks 3–4, show both old and new prices to listings (internally), measure guest acceptance. Full rollout: once revenue-per-listing improves >2% with <0.5% host complaints, roll out to 100k listings. Serving latency SLA: <50ms p99 for pricing API. Model compression: quantization + pruning reduce model from 200MB to 40MB. Monitoring: revenue per listing tracked hourly, automatic rollback if drops >2% within 4 hours. A/B testing: concurrent test on 100k listings (50k new model, 50k old) for final validation before full rollout. Total deployment timeline: 8–12 weeks (2 weeks canary + 2 weeks shadow + 2 weeks A/B + 2 weeks gradual rollout).

Google’s TensorFlow Serving at 1M+ QPS: Google deploys ranking models via TensorFlow Serving handling 1M+ inference requests/second across Search, Maps, Gmail, YouTube. Deployment stages: model passes validation gates (metrics must not regress vs champion), A/B test configuration is declarative (YAML), rollback takes <30 seconds. Serving infrastructure: distributed load balancing across 10k+ servers, batch inference (collect 32 requests, infer together for 2x throughput), model quantization (int8 for mobile). Latency optimization: caching (popular queries’ results cached for 1 hour), hardware acceleration (TPU for large BERT models). Model versioning: 5 recent versions always in memory for A/B testing. A/B testing strategy: 1% canary (1% traffic) for 1 week, then 5% for 1 week, then full rollout if metrics improve. Failure handling: automated rollback if error rate exceeds threshold. Deployment frequency: 100+ new ranking models per week, with full pipeline automation.

Lyft’s Model Deployment with Flyte Orchestration: Lyft uses Flyte (open-source workflow orchestration) for ML deployment, handling 10M+ trip predictions daily (ETA, driver assignment, surge pricing). Deployment pipeline: model registry (ML-flow) → containerization (Docker) → orchestration (Kubernetes) → monitoring (Prometheus + Grafana). Canary strategy: new model tested on 1% of rides for 4 hours, monitoring latency (p99 <200ms), accuracy (compared to driver feedback), cost (inference cost tracked). Shadow mode: new model runs on all traffic for 24 hours, predictions logged but not used (comparison with old model). A/B test: 50/50 split on 100k rides for 7 days, measuring ride completion rate, driver satisfaction. Serving latency: <100ms p99 critical (real-time driver matching). Model size: quantized models (10–50MB) enable edge deployment on mobile. Failure handling: automatic rollback if driver acceptance rate drops >5%. Timeline: 2–4 weeks per model deployment.

Stripe’s Fraud Model Deployment with Immediate Rollback: Stripe deploys fraud detection models with hair-trigger rollback due to high false positive cost. Deployment: blue/green with continuous monitoring. Canary: new fraud model on 0.1% of transactions (~10k txns/hour, enough to evaluate false positive rate) for 24 hours. SLA: false positive rate <0.1% (1 in 1000 legitimate transactions flagged). Shadow mode: 1 week running new model on all traffic, comparing predictions. A/B test: 50/50 split for 3 days, measuring decline rate (% of legitimate txns declined by new model) and fraud catch rate. Serving latency: <50ms p99 (payment approval waits for decision). Automatic rollback: triggered if false positive rate spikes 2x baseline or latency exceeds 100ms. Infrastructure: redundant serving (multi-region) for disaster recovery. Timeline: 2–4 weeks from approval to full deployment, with continuous monitoring thereafter.

References

Designing Machine Learning Systems (Chip Huyen) — Comprehensive deployment chapter
Hidden Technical Debt in Machine Learning Systems (Google, 2015) — Why ML systems fail
TensorFlow Serving Documentation — Production TF model serving
BentoML Documentation — Framework-agnostic serving
KServe Documentation — Kubernetes-native serving
Chip Huyen: ML Systems Design – Deployment — Deployment strategies
Introduction to MLOps (Burkov) — Operations perspective
Continuous Integration/Deployment for Machine Learning (Sculley et al.) — MLOps best practices

AI & Agents, AI Ops

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.