Post

Model Monitoring

The difference between deployed models and production ML systems: Continuous measurement of model and data health, drift detection, automated alerts, and feedback loops to trigger retraining before users are impacted.

Model Monitoring

The difference between deployed models and production ML systems: Continuous measurement of model and data health, drift detection, automated alerts, and feedback loops to trigger retraining before users are impacted.

The Harsh Reality

Most ML failures happen silently. Model’s performance degrades gradually; nobody notices for weeks. By then, business impact is substantial:

1
2
3
4
5
6
7
8
9
Timeline:
Jan 1: Model deployed, AUC = 0.75 (baseline)
Jan 7: User behavior changes (competitor entered market)
       Model's AUC degrades to 0.73 (nobody noticed)
Jan 21: Alert fires, degradation detected
        AUC now 0.68 (15% worse than baseline)
        Estimated $500k revenue loss in 3 weeks

Cause: No monitoring system in place

Solution: Automated monitoring with SLA enforcement (must alert < 1 hour of degradation).


Monitoring Stack

A production ML system requires monitoring at 3 levels:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Application Layer (User-facing):
  - Business metrics (CTR, revenue, engagement)
  - User experience (page load time, conversion rate)

ML Model Layer (Model performance):
  - Accuracy, precision, recall (classification)
  - RMSE, MAE (regression)
  - Latency (P50, P99 percentiles)

Data Layer (Input/output distributions):
  - Feature statistics (mean, std, min, max)
  - Missing values, outliers
  - Label distribution changes

System Layer (Infrastructure):
  - Predictions per second (throughput)
  - Error rate (exceptions, timeouts)
  - Resource usage (CPU, GPU, memory)
  - Cost per prediction

Data Drift: Statistical Tests

Data drift = Input distribution (X) changes from training distribution.

Kolmogorov-Smirnov (KS) Test

Measures maximum difference between two distributions (0=identical, 1=completely different):

1
2
3
4
5
6
7
8
9
10
11
12
13
from scipy.stats import ks_2samp

# Training distribution (reference)
X_train_feature = training_data['user_age']

# Current prediction window (e.g., past 7 days)
X_current_feature = recent_data['user_age']

# KS test
ks_statistic, p_value = ks_2samp(X_train_feature, X_current_feature)

if ks_statistic > 0.1:  # Threshold: >10% difference triggers alert
    print(f"DATA DRIFT DETECTED: age distribution shifted (KS={ks_statistic:.3f})")

Interpretation:

  • KS < 0.05: No drift (distributions stable)
  • KS 0.05–0.1: Minor drift (monitor)
  • KS > 0.1: Significant drift (retrain likely needed)

Population Stability Index (PSI)

Measures change in feature distribution over time (similar to KS but different formula):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def calculate_psi(expected, actual, bins=10):
    """
    PSI = sum((actual_pct - expected_pct) * ln(actual_pct / expected_pct))
    Interpretation:
      PSI < 0.1: No significant change
      PSI 0.1-0.25: Minor change
      PSI > 0.25: Significant change (investigate)
    """
    def percents(x):
        counts, _ = np.histogram(x, bins=bins)
        return counts / np.sum(counts)

    expected_pct = percents(expected)
    actual_pct = percents(actual)

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

# Monitor all features
for feature in features:
    psi = calculate_psi(
        training_data[feature],
        recent_data[feature]
    )
    if psi > 0.25:
        print(f"PSI ALERT: {feature} shifted (PSI={psi:.3f})")

Concept Drift: Performance-Based Detection

Concept drift = Target distribution (y) or decision boundary changes from training.

Example: User preferences change (competitor entered, product changed).

Strategy 1: Accuracy Monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Compute accuracy on labeled samples from past week
from datetime import datetime, timedelta

one_week_ago = datetime.now() - timedelta(days=7)
recent_labels = get_labels(after=one_week_ago)  # Feedback collected

predictions = model.predict(recent_labels['X'])
current_accuracy = (predictions == recent_labels['y']).mean()

baseline_accuracy = 0.95  # From development/validation
threshold_drop = 0.05    # Alert if drops >5%

if current_accuracy < baseline_accuracy - threshold_drop:
    print(f"CONCEPT DRIFT: Accuracy dropped from {baseline_accuracy} to {current_accuracy}")
    trigger_retraining()

Challenge: Labeled data may be sparse. Some feedback arrives slowly.

Strategy 2: Distribution Change Detection on Predictions

Track prediction distribution (without needing labels):

1
2
3
4
5
6
7
8
9
10
11
# Training distribution (reference)
train_predictions = model.predict_proba(X_train)[:, 1]  # Probability of positive class

# Current predictions (recent week)
current_predictions = model.predict_proba(X_recent)[:, 1]

# Check if distributions differ (e.g., KS test)
ks_stat, p_value = ks_2samp(train_predictions, current_predictions)

if ks_stat > 0.15:
    print(f"PREDICTION DRIFT: Model's output distribution changed")

Advantage: Works without labels (no feedback delay) Disadvantage: Doesn’t directly measure accuracy

Strategy 3: Autoencoder Anomaly Detection

Unsupervised drift detection (no labels or predictions needed):

1
2
3
4
5
6
7
8
9
10
11
12
# Train autoencoder on training data
# Autoencoder learns to reconstruct normal inputs
# Reconstruction error = how "unusual" is an input

autoencoder = train_autoencoder(X_train)

# Monitor reconstruction error on current data
current_reconstruction_error = compute_reconstruction_error(X_recent)
baseline_error = compute_reconstruction_error(X_train)

if current_reconstruction_error > baseline_error * 1.5:  # 50% higher
    print(f"ANOMALIES DETECTED: Unusual input patterns (error={current_reconstruction_error})")

Prediction Drift: Silent Failures

Prediction drift = Model makes wrong predictions, but nobody notices (yet).

Detection methods:

  1. User corrections (explicit feedback): “This recommendation sucks”
  2. Implicit signals (behavioral feedback): User ignores recommendation, clicks competitor’s
  3. Delayed labels (ground truth): Days/weeks later, we know if prediction was right
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Example: CTR prediction model
# User was shown ad (model predicted click probability 0.7)
# User didn't click (actual = 0)
# Record: prediction_error = |0.7 - 0| = 0.7

# Aggregate over recent predictions
errors = []
for record in get_recent_predictions(days=7):
    pred = record['predicted_ctr']
    actual = record['user_clicked']
    error = abs(pred - actual)
    errors.append(error)

mean_error = np.mean(errors)  # Calibration error
if mean_error > baseline_mean_error * 1.2:  # 20% worse
    print(f"PREDICTION DRIFT: Mean absolute error increased")

Retraining Triggers & Strategies

When to retrain, and how:

Retraining Strategy Matrix

Trigger Frequency Cost Risk When to Use
Scheduled Daily/Weekly/Monthly Predictable Low (known) Baseline; prevents drift accumulation
Performance-based On accuracy drop Variable Medium Production systems; continuous improvement
Data drift-based When drift detected Variable Medium Proactive; catch issues early
Manual Ad-hoc Low High Debugging, hotfixes

Best practice: Combine scheduled + drift-based (weekly retrain + alert on drift)

Implementation: Automated Retraining Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Monitoring job (runs daily)
def monitoring_job():
    """Check model health, trigger retrain if needed."""

    # 1. Check for data drift
    current_data = fetch_recent_data(days=7)
    ks_stat = check_drift(current_data)

    # 2. Check for performance degradation
    recent_labels = fetch_labeled_data(days=7)
    current_accuracy = evaluate(recent_labels)

    # 3. Decide: retrain?
    should_retrain = False
    reason = ""

    if ks_stat > 0.1:
        should_retrain = True
        reason = f"Data drift detected (KS={ks_stat})"

    if current_accuracy < baseline_accuracy - 0.05:
        should_retrain = True
        reason = f"Accuracy dropped to {current_accuracy}"

    # 4. Trigger retrain if needed
    if should_retrain:
        print(f"Triggering retrain: {reason}")
        trigger_retraining_pipeline()
    else:
        print(f"Model health OK (accuracy={current_accuracy}, drift={ks_stat})")

# Retraining job (runs when triggered)
def retraining_pipeline():
    """Retrain model with latest data."""

    # 1. Collect new data (all data since last retrain)
    all_data = fetch_training_data(since_last_retrain=True)

    # 2. Retrain (hyperparameters may change)
    new_model = train_model(all_data)

    # 3. Validate on held-out test set
    test_accuracy = evaluate(new_model, test_data)

    # 4. Compare: is new model better?
    if test_accuracy > current_model_accuracy:
        print(f"New model better ({test_accuracy} > {current_model_accuracy})")
        deploy_model(new_model)  # Canary to 5% → evaluate → full rollout
    else:
        print(f"New model worse; keeping current")

Monitoring Architecture

Simple (Small Company)

1
2
3
4
5
6
7
8
9
10
Application
    ↓
Log predictions to database (BigQuery)
    ↓
Daily batch job:
  - Compute accuracy (using labels from feedback)
  - Detect drift (KS test)
  - Alert if issues (email to team)
    ↓
If alert: Manual retrain, evaluate, deploy

Tools: BigQuery + Python script + Airflow + email alerts

Complex (Large Company)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Application
    ↓
Stream predictions (Kafka)
    ↓
Real-time monitoring (Flink/Spark Streaming):
  - Compute rolling accuracy (windowed)
  - Detect drift (continuous)
  - Predict P(need retraining) using ML
    ↓
Triggering pipeline (orchestrated):
  - Automated data collection
  - Model retraining (parallel grid search)
  - Validation
  - Canary deployment (if passes)
    ↓
Dashboards (Grafana, Datadog):
  - Real-time metrics
  - Alerting via Slack
  - Rollback on critical failure

Tools: Kafka + Flink + MLflow + Kubernetes + Grafana + PagerDuty


Monitoring Dashboard (Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Click Prediction Model — Production Dashboard

  Model Performance (past 7 days):
  - Accuracy:      95.2% (baseline: 95.5%)
  - AUC:           0.752 (baseline: 0.760)
  - Latency P99:   52ms (budget: 100ms)
  - Predictions:   12.3B/day (typical: 10B–15B)

  Data Quality:
  - Data drift (KS): 0.08 (threshold: 0.1)
  - Missing values:  0.2% (typical: 0–0.5%)
  - Class balance:   96% neg, 4% pos

  Retraining Status:
  - Last retrain:   2 days ago
  - Next scheduled: Tomorrow 2am (weekly)
  - Retrains/month: 4 (scheduled 4 + drift 0)

  Alerts: NONE

Feedback Loops

Models improve when integrated into feedback collection:

1
2
3
4
5
6
7
8
9
10
Prediction (model says: "recommend movie X")
    ↓
User action (watched? liked? ignored?)
    ↓
Implicit signal (engagement = feedback)
    ↓
Store as labeled example:
  X = {features}, y = {user_engaged}
    ↓
Use for next retraining

Types of feedback:

  • Explicit: User ratings, reviews (“5 stars”)
  • Implicit: Clicks, views, dwell time (how long user engages)
  • Delayed: Ground truth arrives days/weeks later

SLA Enforcement

Define and monitor model SLAs (Service Level Agreements):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Define SLA
SLA = {
    'accuracy': 0.95,        # Must be >=95%
    'latency_p99': 100,      # Must be <100ms
    'availability': 0.999,   # Must be 99.9% uptime
    'drift': 0.1             # Data drift KS must be <0.1
}

# Monitor (hourly)
def check_sla():
    metrics = {
        'accuracy': compute_accuracy(past_1h),
        'latency_p99': compute_latency_p99(past_1h),
        'availability': compute_availability(past_1h),
        'drift': compute_drift(past_1h)
    }

    violations = []
    for metric, threshold in SLA.items():
        if metrics[metric] < threshold:
            violations.append(f"{metric}: {metrics[metric]} < {threshold}")

    if violations:
        alert_severity = 'CRITICAL'
        send_alert(violations, severity=alert_severity)
        page_oncall()  # Wake someone up
    else:
        log_ok()

Implementation Example: Monitoring Pipeline (Python + Airflow)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# monitoring_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-team',
    'depends_on_past': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'model_monitoring',
    default_args=default_args,
    schedule_interval='0 * * * *'  # Hourly
)

def check_data_drift():
    """Monitor data drift using KS test."""
    current_data = fetch_recent_data(hours=1)
    ks_stat = compute_ks_statistic(current_data)

    if ks_stat > 0.1:
        send_alert(f'Data drift detected: KS={ks_stat}')

def check_performance():
    """Monitor model accuracy."""
    recent_labels = fetch_recent_labels(hours=1)
    accuracy = evaluate_model(recent_labels)

    if accuracy < 0.95:
        send_alert(f'Accuracy dropped: {accuracy}')

def trigger_retrain_if_needed():
    """Decide: should we retrain?"""
    drift = compute_drift()
    accuracy = compute_accuracy()

    if drift > 0.1 or accuracy < 0.95:
        trigger_airflow_job('retraining_pipeline')

# Define DAG tasks
check_drift_task = PythonOperator(task_id='check_drift', python_callable=check_data_drift, dag=dag)
check_perf_task = PythonOperator(task_id='check_performance', python_callable=check_performance, dag=dag)
retrain_task = PythonOperator(task_id='trigger_retrain', python_callable=trigger_retrain_if_needed, dag=dag)

# Define dependencies
check_drift_task >> retrain_task
check_perf_task >> retrain_task

How Real Companies Use This

Uber’s Michelangelo Monitoring (10,000+ Predictions Per Second): Uber monitors 10,000+ model predictions/second across ETA, surge pricing, driver matching, and fraud detection systems. Key metrics monitored: feature drift (Population Stability Index per feature, threshold >0.2 triggers alert), prediction distribution shift (KS-test, p<0.05 signals drift), model accuracy (ground truth arrives 24–48 hours post-trip completion from user ratings). Retraining triggers: automatic retraining when PSI >0.25 on any top-10 feature (e.g., driver acceptance rate changes). Alert hierarchy: minor drift (<0.1 PSI) → log and monitor; moderate drift (0.1–0.25 PSI) → alert ops team; major drift (>0.25) → auto-retrain. Monitoring dashboards: Grafana showing real-time accuracy, latency, feature statistics. SLA enforcement: ETA model must maintain <2-minute RMSE 99% of the time; if breached, retrain within 4 hours. Feedback loops: user ratings (1–5 stars) fed back into retraining pipeline daily, allowing model to adapt to changing traffic patterns.

Lyft’s Manifold Observability Tool (5M+ Daily Trips): Lyft developed Manifold (open-source) for ML observability, monitoring feature distributions and model predictions across 50+ production models. Key signals: feature staleness (if feature pipeline stalls >30 minutes, features marked stale), data completeness (if missing values exceed 1%, alert fires), prediction distribution (histogram comparison with baseline). Concept drift detection: KS-test on model outputs (predictions) comparing current 7-day window to 90-day baseline. Retraining trigger: if KS-test p-value <0.05 (significant distribution shift), manual review queue populated for data scientists (auto-retraining not used due to fraud risk). Post-incident response: if monitoring detects anomaly, inference service falls back to simpler heuristic model (fallback latency: <20ms). Retraining pipeline: completes in <4 hours for quick recovery. Dashboards: Datadog integration showing metric trends, anomaly detection flags, alert history. SLA: alerting latency <5 minutes (detect issues before major user impact).

Stripe’s Fraud Model Monitoring (5M Transactions Daily, Adversarial Drift): Stripe monitors 5M+ transaction predictions daily with focus on false positive rate (FPR) SLA: <0.1% (catch 99.9% of legitimate transactions). Key metrics: fraud detection rate (% of actual fraud caught, target 95%+), false positive rate (legitimate txns incorrectly flagged, SLA <0.1%), model latency (p99 <50ms). Ground truth arrives 30–90 days post-transaction (chargeback notices). Drift detection: temporal holdout validation (train on T-90 to T-30 days, validate on T-30 to T days) simulates production data shift. Concept drift: tracked via precision-recall curve (as fraud patterns evolve, curve shifts). Alert triggers: if false positive rate spikes 3x baseline within 1 day, page oncall engineer. When FPR breaches SLA, model rolls back to previous version automatically. Retraining: triggered weekly on fresh 90-day window (moving window captures latest fraud patterns). Adversarial considerations: fraudsters adapt to model, so monthly architecture experiments run to stay ahead. Monitoring dashboard: internal tool showing FPR hourly, fraud loss trending, model version performance.

LinkedIn’s Feed Ranking Monitoring (Concept Drift in Member Preferences): LinkedIn monitors feed ranking models for concept drift daily, tracking NDCG (offline) and click-through rate (online) across member segments. Key monitored signals: member engagement distribution (do members still click on recommended posts?), content type shifts (are video posts more engaging than text?), temporal signals (time-of-day patterns change seasonally). Drift detection: prediction distribution tracked via histogram comparison (KS-test). Ground truth: delayed 7+ days (need time to accumulate engagement on older posts). Retraining triggers: weekly scheduled retrain on 14-day rolling window + on-demand if engagement drops >5% vs baseline. A/B test monitoring: new ranking variants continuously tested on 1–10% of members, comparing watch time and member satisfaction. Fallback strategy: if new model underperforms, revert to previous within 1 hour. Segment-level monitoring: performance tracked separately by country (engagement patterns differ), device type (mobile vs desktop preferences diverge), member tenure (new vs established member behaviors differ). Alert SLA: engagement drop detected and escalated <2 hours.

Netflix’s Recommendation Monitoring with Churn Prevention: Netflix monitors recommendation model performance focusing on watch hours (business metric) and NDCG (ML metric). Key signals: member watch time distribution (are members finding content to watch?), new-to-catalog discovery rate (are recommendations introducing members to fresh content?), churn rate (do members with poor recommendations leave?). Concept drift: seasonal patterns (holiday viewing, summer blockbusters) detected via trend analysis. Ground truth: available after 7 days (time for member to discover and watch). Retraining triggers: weekly scheduled (capture seasonal trends) + on-demand if watch hours drop >2% vs 4-week baseline. A/B testing: new recommendation models tested on 5–10% of 250M members, measuring watch hours over 2 weeks before full rollout. Fallback: if watch hours decline, revert to previous model within 4 hours. Feedback loops: member skips (quick exits from recommendations), watches (engagement signal), ratings (explicit feedback) all feed into retraining. Monitoring dashboard: Tableau showing watch hours trending, segment-level performance (by country, device), model version comparison. SLA: watch hours protected via alert if drops >2% YoY.


References

This post is licensed under CC BY 4.0 by the author.