Post

Model Development

Systematic experimentation and iteration: Train candidate algorithms, tune hyperparameters, validate robustly, analyze failures, compare, and select the best model for production.

Model Development

Systematic experimentation & iteration: Train candidate algorithms, tune hyperparameters, validate robustly, analyze failures, compare, and select the best model for production.

The Experimentation Loop

Model development is rapid iteration:

1
2
3
4
5
6
7
8
9
10
11
Baseline Model (simple, interpretable)
         ↓
Interpret results (what works? what fails?)
         ↓
Hypothesis (feature engineering, algorithm change)
         ↓
Experiment (train new model, compare)
         ↓
Better? → YES → Update best model
         ↓
         NO → Try different hypothesis

Each experiment takes hours to days; companies run 100+ experiments per week at scale.


Algorithm Selection

Golden Rule: Start simple, add complexity only when needed.

Algorithm by Problem Type

Problem Type Algorithms Data Size Training Time Interpretability
Binary Classification Log Reg, Decision Tree, XGBoost, Neural Net 1k–1M sec–min Tree > Log Reg > NN
Multiclass Classification Same as binary 1k–1M sec–min Same
Regression Linear, Ridge, XGBoost, Neural Net 1k–1M sec–min Linear > XGBoost
Ranking LambdaMART, Neural Ranker (LTR) 100k–10M min–hour LambdaMART
Clustering K-Means, DBSCAN, Gaussian Mixture 1k–1M sec K-Means
Anomaly Detection Isolation Forest, Autoencoders 10k–1M min–hour Isolation Forest

Algorithm Selection Criteria

Data Size:

  • Small (<10k): Linear models (logistic regression) or tree-based (avoid overfitting)
  • Medium (10k–1M): Gradient boosting (XGBoost, LightGBM) — best all-around
  • Large (>1M): Neural networks (leverage large data) or ensemble methods

Latency Budget:

  • <10ms: Tree-based (fast inference)
  • 50–200ms: Gradient boosting or simple neural net
  • >1s: Complex neural nets, ensembles

Data Type:

  • Tabular: XGBoost, gradient boosting
  • Images: Convolutional neural networks (ResNet, EfficientNet)
  • Text: Transformers (BERT, GPT), gradient boosting with text features
  • Time-series: RNNs (LSTM), Transformers, ARIMA

Interpretability Requirement:

  • High (finance, healthcare): Linear models, decision trees, SHAP explanations
  • Medium: XGBoost (feature importance available)
  • Low: Neural networks, ensembles

Baseline Model

Always establish a baseline. Baselines answer: “Is our fancy algorithm better than simple?”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Baseline 1: Logistic regression (linear, interpretable)
lr_baseline = LogisticRegression(max_iter=1000)
lr_baseline.fit(X_train, y_train)
lr_auc = roc_auc_score(y_val, lr_baseline.predict_proba(X_val)[:, 1])
print(f"Baseline (Logistic Regression) AUC: {lr_auc:.3f}")

# Baseline 2: Decision tree (non-linear, simple)
dt_baseline = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_baseline.fit(X_train, y_train)
dt_auc = roc_auc_score(y_val, dt_baseline.predict_proba(X_val)[:, 1])
print(f"Baseline (Decision Tree) AUC: {dt_auc:.3f}")

# Record baseline for comparison
baseline_auc = max(lr_auc, dt_auc)
print(f"\nBaseline AUC to beat: {baseline_auc:.3f}")

Cross-Validation

Cross-validation protects against overfitting on validation set. It’s not optional; it’s mandatory.

K-Fold Cross-Validation

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import cross_val_score, KFold

# Standard k-fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
# Output: CV AUC: 0.720 +/- 0.025 means model averages 0.720, varies by 2.5%

# Stratified k-fold (for imbalanced data)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')

Time-Series Cross-Validation

For time-series, don’t use random CV (leakage). Use temporal splits:

1
2
3
4
5
6
7
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X_train):
    X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[test_idx]
    # Train on past, validate on future

Hyperparameter Tuning

Hyperparameters are knobs that control model complexity (learning rate, tree depth, regularization).

Tuning Strategies

Grid Search (Exhaustive):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.1, 0.5],
    'n_estimators': [100, 200, 500]
}

grid_search = GridSearchCV(
    XGBClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1  # Parallel
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

Random Search (Faster, good for large spaces):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'max_depth': [3, 5, 7, 10, 15],
    'learning_rate': [0.001, 0.01, 0.1, 0.5],
    'min_child_weight': [1, 3, 5, 7],
    'subsample': [0.5, 0.7, 0.9, 1.0]
}

random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42),
    param_dist,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
random_search.fit(X_train, y_train)

Bayesian Optimization (Smart, efficient):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from optuna import create_study
from xgboost import XGBClassifier

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.5),
        'n_estimators': trial.suggest_int('n_estimators', 50, 500)
    }

    model = XGBClassifier(**params, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    return scores.mean()

study = create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(f"Best AUC: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")

Error Analysis

After training, analyze where the model fails. This drives next experiments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report

# Predictions on validation set
y_pred = model.predict(X_val)
y_pred_proba = model.predict_proba(X_val)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_val, y_pred)
print(f"""
Confusion Matrix:
  Predicted No | Predicted Yes
Actual No:     {cm[0,0]:7d} |      {cm[0,1]:7d}
Actual Yes:    {cm[1,0]:7d} |      {cm[1,1]:7d}
""")

# Classification report
print(classification_report(y_val, y_pred))

# Error analysis: segment performance
# Where does model fail most?
errors_df = X_val.copy()
errors_df['actual'] = y_val.values
errors_df['predicted'] = y_pred
errors_df['correct'] = (y_val.values == y_pred)

print("\nAccuracy by user age group:")
print(errors_df.groupby(pd.cut(errors_df['user_age'], bins=[0, 25, 35, 50, 100]))['correct'].mean())

print("\nAccuracy by item category:")
print(errors_df.groupby('item_category')['correct'].mean().sort_values())
# Identify underperforming segments → retrain with more data or different approach

Learning Curves

Learning curves reveal if model needs more data or regularization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    model, X_train, y_train, cv=5, scoring='roc_auc',
    train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training AUC')
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation AUC')
plt.xlabel('Training Set Size')
plt.ylabel('AUC')
plt.legend()
plt.title('Learning Curve')
plt.show()

# Interpretation:
# - Gap between train/val: overfitting → regularize
# - Both curves low: underfitting → more data or complex model
# - Both curves high and close: good! Add more data for marginal gains

Experiment Tracking

Problem: Without logging, you lose track of what worked.

Solution: MLflow (or Weights & Biases, Neptune)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import mlflow
import mlflow.sklearn

# Start experiment run
mlflow.start_run(run_name="xgb_v2_increased_depth")

# Log parameters
mlflow.log_params({
    'algorithm': 'XGBoost',
    'max_depth': 8,
    'learning_rate': 0.05,
    'n_estimators': 200
})

# Log metrics
model = XGBClassifier(max_depth=8, learning_rate=0.05, n_estimators=200)
model.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
test_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

mlflow.log_metrics({
    'train_auc': train_auc,
    'val_auc': val_auc,
    'test_auc': test_auc
})

# Log model
mlflow.sklearn.log_model(model, 'model')

# Log artifacts (plots, reports)
mlflow.log_artifact('learning_curve.png')

mlflow.end_run()

# Later: view all experiments
mlflow.search_runs(experiment_names=['my_project'])
# Compare: XGB v1 (AUC 0.72) vs XGB v2 (AUC 0.75) → v2 better!

Model Comparison

After N experiments, select the best model:

Model Algorithm Train AUC Val AUC Test AUC Latency Training Time Complexity
Baseline Logistic Reg 0.680 0.672 0.675 1ms 10s Low
v1 XGBoost (d=5) 0.710 0.705 0.708 15ms 1m Medium
v2 XGBoost (d=8) 0.740 0.722 0.720 20ms 2m Medium
v3 Neural Net (128-64) 0.745 0.715 0.712 80ms 10m High
v4 Ensemble (XGB+NN) 0.750 0.725 0.723 90ms 12m Very High

Decision: v2 wins (test AUC 0.720, reasonable latency). v3/v4 overfit despite better training scores.


Model Card & Documentation

Before deployment, document the model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Model Card: Click Prediction v2

## Overview
XGBoost model predicting ad click probability.
Optimized for AUC (ranking quality); latency budget <50ms.

## Performance (Test Set)
- AUC: 0.720
- Precision@1%: 0.45 (45% of top predictions are correct)
- Recall@1%: 0.62 (catches 62% of clicks in top 1%)

## Training Details
- Algorithm: XGBoost
- Hyperparameters: max_depth=8, learning_rate=0.05, n_estimators=200
- Training data: 80% of 100M user-item interactions (2025-01-01 to 2025-11-30)
- Training time: 2 hours on 8 GPUs

## Limitations
- Model trained on web traffic; mobile app performance unknown
- Performance varies by user segment (+-5% by age group)
- Requires fresh user history; new users get baseline recommendations

## Fairness & Bias
- No demographic bias detected (age, gender have <2% AUC gap)
- Bias audit: Every quarter on demographic slices

## Next Steps
- A/B test against baseline (20% traffic)
- Monitor for data drift (weekly dashboard)
- Retrain monthly with new feedback

Key Properties by Algorithm

Algorithm Training Time Inference Time Data Needs Interpretability Robustness
Logistic Regression sec <1ms 1k examples Excellent Excellent
Decision Tree sec <1ms 1k examples Excellent Poor
Random Forest min 10ms 10k examples Good Good
XGBoost min–hour 20ms 10k examples Medium (SHAP) Excellent
Neural Network hour 10–100ms 100k examples Poor (black box) Variable
Transformer hours 100–1000ms 1M examples Poor Variable

Implementation Example: Full Model Development

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from xgboost import XGBClassifier

# Set experiment
mlflow.set_experiment('click_prediction')

# Baseline
print("=== Baseline: Logistic Regression ===")
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {lr_scores.mean():.3f}")
baseline_auc = lr_scores.mean()

# Experiment 1: XGBoost with default params
print("\n=== Exp 1: XGBoost (default) ===")
with mlflow.start_run(run_name="xgb_default"):
    xgb = XGBClassifier(random_state=42, n_jobs=-1)
    xgb_scores = cross_val_score(xgb, X_train, y_train, cv=5, scoring='roc_auc')
    mlflow.log_metric('cv_auc', xgb_scores.mean())
    print(f"CV AUC: {xgb_scores.mean():.3f}")
    if xgb_scores.mean() > baseline_auc:
        print("Improvement over baseline!")

# Experiment 2: XGBoost with tuned hyperparameters
print("\n=== Exp 2: XGBoost (tuned) ===")
with mlflow.start_run(run_name="xgb_tuned"):
    param_grid = {
        'max_depth': [5, 7, 8],
        'learning_rate': [0.05, 0.1],
        'n_estimators': [200, 300]
    }

    grid_search = GridSearchCV(
        XGBClassifier(random_state=42, n_jobs=-1),
        param_grid,
        cv=5,
        scoring='roc_auc',
        n_jobs=-1
    )
    grid_search.fit(X_train, y_train)

    mlflow.log_params(grid_search.best_params_)
    mlflow.log_metric('cv_auc', grid_search.best_score_)
    print(f"Best CV AUC: {grid_search.best_score_:.3f}")
    print(f"Best params: {grid_search.best_params_}")

    # Evaluate on test set
    best_model = grid_search.best_estimator_
    test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
    mlflow.log_metric('test_auc', test_auc)
    print(f"Test AUC: {test_auc:.3f}")

# Compare experiments
print("\n=== Experiment Summary ===")
runs = mlflow.search_runs(experiment_names=['click_prediction'])
print(runs[['run_name', 'metrics.cv_auc', 'metrics.test_auc']].sort_values('metrics.test_auc', ascending=False))

How Real Companies Use This

Google Search Ranking (1000+ Experiments Per Year): Google’s search ranking system involves 1000+ A/B experiments yearly to improve NDCG (ranking quality). Model development cycle: teams propose ranking hypotheses, train candidate models on curated 10k query-document pairs (labels from human raters), evaluate offline using LambdaMART (learn-to-rank algorithm). Baseline: previous ranking model. Hyperparameter tuning via Vizier (Bayesian optimization) across 20+ dimensions (learning rate, tree depth, regularization). Single training run: 4–6 hours on 10k TPUs. Algorithm selection: start with LambdaMART (gradient boosting, interpretable feature importance), advance to BERT-based rankers for complex query understanding. Error analysis: NDCG computed per query type (navigational, informational, commercial) to detect segment failures. Experiment tracking: MLflow-equivalent system logs 1000s of runs, enabling comparison across teams. A/B testing: only models passing offline threshold (NDCG +0.5%) advance to online testing (1% traffic for 2 weeks). Key learning: 90% of experiments show no improvement; persistence is critical.

Meta’s Recommendation System Development (50+ Models in Parallel): Meta’s Feed ranking involves 50+ deep learning models operating in concert (embedding models, pairwise rankers, aggregation models). Model development at scale: teams experiment with novel architectures weekly (DLRM enhancements, multi-task learning variations). Baseline: previous ensemble. Data scale: 1TB+ daily user interactions (impressions, engagements). Hyperparameter tuning: HyperOpt searching 20+ dimensions (embedding dimensions: 32–512, MLP layer sizes, learning rates: 0.0001–0.1, dropout: 0.0–0.5). Single training: 30 minutes on 128 A100 GPUs. Error analysis: segment-level performance (tracked separately for users by age, geography, device) revealed cold-start users underperform (new users have sparse features). Solution: content-based features added as fallback for cold users. Experiment tracking: every model version logged with offline metrics (NDCG, coverage) and online metrics (CTR, watch time). A/B testing mandatory: new models tested on 1% of 3B DAU before broader rollout.

Spotify’s Podcast Recommendation (50+ Architectures Over 6 Months): Spotify’s podcast recommendation team experimented with 50+ model architectures over 6 months to improve podcast discovery. Baseline: collaborative filtering (existing system). Data: 1B+ podcast listens per week. Candidates: neural collaborative filtering, graph neural networks, two-tower deep networks, content-based (episode transcript embeddings). Hyperparameter tuning: Random search (50 trials) for each architecture across learning rate, embedding dimension, loss function. Training per trial: 2 hours on 4 GPUs. Cross-validation: stratified by user (new vs returning), podcast (new vs established) to detect overfitting to popular content. Error analysis: classical CF performs well on popular podcasts, neural networks excel on niche podcasts. Winner: two-tower model (separate embeddings for user, podcast; learned via contrastive loss). Performance: 8% better precision@20 vs baseline. A/B test: 10% traffic for 3 weeks, measuring podcast save rate and listen completion.

DoorDash Delivery Time Estimation (Complex Features, Continuous Iteration): DoorDash’s model development for delivery time prediction evolved from simple linear regression (baseline RMSE 15 min) to gradient boosting (RMSE 8 min) to neural networks (RMSE 7 min). Experimentation: 100+ trials per month testing feature engineering, hyperparameter choices, ensemble strategies. Baseline: historical average delivery time per restaurant. Algorithm selection: XGBoost chosen over neural networks due to 3x faster inference (serving latency <50ms critical). Hyperparameter tuning: Bayesian optimization (Optuna) over max_depth (3–15), learning_rate (0.01–0.5), subsample (0.5–1.0). Training: 2 hours on 8 GPUs per trial. Error analysis: performance varies by restaurant type (Chinese takeout: 5% error; fine dining: 20% error due to prep time variation). Segment-level training: separate models for delivery vs pickup (very different time distributions). Cross-validation: time-series split (train on past 60 days, validate on next 14 days) to avoid future leakage. Experiment tracking: 10k runs logged in MLflow, enabling team to see which combinations work.

Netflix’s Recommendation Algorithm Research (Incremental Improvements Over Years): Netflix’s algorithm development for personalized ranking is decades-long (starting with Cinematch in 2006). Each year: 50–100 experiments exploring collaborative filtering refinements, content-based embeddings, contextual factors. Baseline: previous champion model. Data: 500B+ ratings and implicit signals (plays, pauses, skips). Algorithm selection: gradient boosting (stable, interpretable) vs neural CF vs graph neural networks (GNNs to model social influence). Hyperparameter tuning: random search (50 trials) for each candidate, focusing on regularization (model generalizes beyond training data). Training: 6 hours on 16 GPUs per trial. Error analysis: NDCG per member segment (by country, device, subscription tier) revealed mobile users prefer shorter movies (time-constrained). Cold-start strategies: new members receive popularity-based recommendations until enough behavior accumulated. A/B testing: new models tested on 5–10% of 250M members, measuring watch hours (business metric) not just NDCG (ML metric). Learning: marginal improvements (1–2%) compound over years.


References

This post is licensed under CC BY 4.0 by the author.