Model Development
Systematic experimentation and iteration: Train candidate algorithms, tune hyperparameters, validate robustly, analyze failures, compare, and select the best model for production.
Systematic experimentation & iteration: Train candidate algorithms, tune hyperparameters, validate robustly, analyze failures, compare, and select the best model for production.
The Experimentation Loop
Model development is rapid iteration:
1
2
3
4
5
6
7
8
9
10
11
Baseline Model (simple, interpretable)
↓
Interpret results (what works? what fails?)
↓
Hypothesis (feature engineering, algorithm change)
↓
Experiment (train new model, compare)
↓
Better? → YES → Update best model
↓
NO → Try different hypothesis
Each experiment takes hours to days; companies run 100+ experiments per week at scale.
Algorithm Selection
Golden Rule: Start simple, add complexity only when needed.
Algorithm by Problem Type
| Problem Type | Algorithms | Data Size | Training Time | Interpretability |
|---|---|---|---|---|
| Binary Classification | Log Reg, Decision Tree, XGBoost, Neural Net | 1k–1M | sec–min | Tree > Log Reg > NN |
| Multiclass Classification | Same as binary | 1k–1M | sec–min | Same |
| Regression | Linear, Ridge, XGBoost, Neural Net | 1k–1M | sec–min | Linear > XGBoost |
| Ranking | LambdaMART, Neural Ranker (LTR) | 100k–10M | min–hour | LambdaMART |
| Clustering | K-Means, DBSCAN, Gaussian Mixture | 1k–1M | sec | K-Means |
| Anomaly Detection | Isolation Forest, Autoencoders | 10k–1M | min–hour | Isolation Forest |
Algorithm Selection Criteria
Data Size:
- Small (<10k): Linear models (logistic regression) or tree-based (avoid overfitting)
- Medium (10k–1M): Gradient boosting (XGBoost, LightGBM) — best all-around
- Large (>1M): Neural networks (leverage large data) or ensemble methods
Latency Budget:
- <10ms: Tree-based (fast inference)
- 50–200ms: Gradient boosting or simple neural net
- >1s: Complex neural nets, ensembles
Data Type:
- Tabular: XGBoost, gradient boosting
- Images: Convolutional neural networks (ResNet, EfficientNet)
- Text: Transformers (BERT, GPT), gradient boosting with text features
- Time-series: RNNs (LSTM), Transformers, ARIMA
Interpretability Requirement:
- High (finance, healthcare): Linear models, decision trees, SHAP explanations
- Medium: XGBoost (feature importance available)
- Low: Neural networks, ensembles
Baseline Model
Always establish a baseline. Baselines answer: “Is our fancy algorithm better than simple?”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Baseline 1: Logistic regression (linear, interpretable)
lr_baseline = LogisticRegression(max_iter=1000)
lr_baseline.fit(X_train, y_train)
lr_auc = roc_auc_score(y_val, lr_baseline.predict_proba(X_val)[:, 1])
print(f"Baseline (Logistic Regression) AUC: {lr_auc:.3f}")
# Baseline 2: Decision tree (non-linear, simple)
dt_baseline = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_baseline.fit(X_train, y_train)
dt_auc = roc_auc_score(y_val, dt_baseline.predict_proba(X_val)[:, 1])
print(f"Baseline (Decision Tree) AUC: {dt_auc:.3f}")
# Record baseline for comparison
baseline_auc = max(lr_auc, dt_auc)
print(f"\nBaseline AUC to beat: {baseline_auc:.3f}")
Cross-Validation
Cross-validation protects against overfitting on validation set. It’s not optional; it’s mandatory.
K-Fold Cross-Validation
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import cross_val_score, KFold
# Standard k-fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
# Output: CV AUC: 0.720 +/- 0.025 means model averages 0.720, varies by 2.5%
# Stratified k-fold (for imbalanced data)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')
Time-Series Cross-Validation
For time-series, don’t use random CV (leakage). Use temporal splits:
1
2
3
4
5
6
7
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X_train):
X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[test_idx]
y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[test_idx]
# Train on past, validate on future
Hyperparameter Tuning
Hyperparameters are knobs that control model complexity (learning rate, tree depth, regularization).
Tuning Strategies
Grid Search (Exhaustive):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 5, 7, 10],
'learning_rate': [0.01, 0.1, 0.5],
'n_estimators': [100, 200, 500]
}
grid_search = GridSearchCV(
XGBClassifier(random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1 # Parallel
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
Random Search (Faster, good for large spaces):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
'max_depth': [3, 5, 7, 10, 15],
'learning_rate': [0.001, 0.01, 0.1, 0.5],
'min_child_weight': [1, 3, 5, 7],
'subsample': [0.5, 0.7, 0.9, 1.0]
}
random_search = RandomizedSearchCV(
XGBClassifier(random_state=42),
param_dist,
n_iter=20, # Try 20 random combinations
cv=5,
scoring='roc_auc',
n_jobs=-1
)
random_search.fit(X_train, y_train)
Bayesian Optimization (Smart, efficient):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from optuna import create_study
from xgboost import XGBClassifier
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.5),
'n_estimators': trial.suggest_int('n_estimators', 50, 500)
}
model = XGBClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
return scores.mean()
study = create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(f"Best AUC: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")
Error Analysis
After training, analyze where the model fails. This drives next experiments.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
# Predictions on validation set
y_pred = model.predict(X_val)
y_pred_proba = model.predict_proba(X_val)[:, 1]
# Confusion matrix
cm = confusion_matrix(y_val, y_pred)
print(f"""
Confusion Matrix:
Predicted No | Predicted Yes
Actual No: {cm[0,0]:7d} | {cm[0,1]:7d}
Actual Yes: {cm[1,0]:7d} | {cm[1,1]:7d}
""")
# Classification report
print(classification_report(y_val, y_pred))
# Error analysis: segment performance
# Where does model fail most?
errors_df = X_val.copy()
errors_df['actual'] = y_val.values
errors_df['predicted'] = y_pred
errors_df['correct'] = (y_val.values == y_pred)
print("\nAccuracy by user age group:")
print(errors_df.groupby(pd.cut(errors_df['user_age'], bins=[0, 25, 35, 50, 100]))['correct'].mean())
print("\nAccuracy by item category:")
print(errors_df.groupby('item_category')['correct'].mean().sort_values())
# Identify underperforming segments → retrain with more data or different approach
Learning Curves
Learning curves reveal if model needs more data or regularization:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
model, X_train, y_train, cv=5, scoring='roc_auc',
train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training AUC')
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation AUC')
plt.xlabel('Training Set Size')
plt.ylabel('AUC')
plt.legend()
plt.title('Learning Curve')
plt.show()
# Interpretation:
# - Gap between train/val: overfitting → regularize
# - Both curves low: underfitting → more data or complex model
# - Both curves high and close: good! Add more data for marginal gains
Experiment Tracking
Problem: Without logging, you lose track of what worked.
Solution: MLflow (or Weights & Biases, Neptune)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import mlflow
import mlflow.sklearn
# Start experiment run
mlflow.start_run(run_name="xgb_v2_increased_depth")
# Log parameters
mlflow.log_params({
'algorithm': 'XGBoost',
'max_depth': 8,
'learning_rate': 0.05,
'n_estimators': 200
})
# Log metrics
model = XGBClassifier(max_depth=8, learning_rate=0.05, n_estimators=200)
model.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
test_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
mlflow.log_metrics({
'train_auc': train_auc,
'val_auc': val_auc,
'test_auc': test_auc
})
# Log model
mlflow.sklearn.log_model(model, 'model')
# Log artifacts (plots, reports)
mlflow.log_artifact('learning_curve.png')
mlflow.end_run()
# Later: view all experiments
mlflow.search_runs(experiment_names=['my_project'])
# Compare: XGB v1 (AUC 0.72) vs XGB v2 (AUC 0.75) → v2 better!
Model Comparison
After N experiments, select the best model:
| Model | Algorithm | Train AUC | Val AUC | Test AUC | Latency | Training Time | Complexity |
|---|---|---|---|---|---|---|---|
| Baseline | Logistic Reg | 0.680 | 0.672 | 0.675 | 1ms | 10s | Low |
| v1 | XGBoost (d=5) | 0.710 | 0.705 | 0.708 | 15ms | 1m | Medium |
| v2 | XGBoost (d=8) | 0.740 | 0.722 | 0.720 | 20ms | 2m | Medium |
| v3 | Neural Net (128-64) | 0.745 | 0.715 | 0.712 | 80ms | 10m | High |
| v4 | Ensemble (XGB+NN) | 0.750 | 0.725 | 0.723 | 90ms | 12m | Very High |
Decision: v2 wins (test AUC 0.720, reasonable latency). v3/v4 overfit despite better training scores.
Model Card & Documentation
Before deployment, document the model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Model Card: Click Prediction v2
## Overview
XGBoost model predicting ad click probability.
Optimized for AUC (ranking quality); latency budget <50ms.
## Performance (Test Set)
- AUC: 0.720
- Precision@1%: 0.45 (45% of top predictions are correct)
- Recall@1%: 0.62 (catches 62% of clicks in top 1%)
## Training Details
- Algorithm: XGBoost
- Hyperparameters: max_depth=8, learning_rate=0.05, n_estimators=200
- Training data: 80% of 100M user-item interactions (2025-01-01 to 2025-11-30)
- Training time: 2 hours on 8 GPUs
## Limitations
- Model trained on web traffic; mobile app performance unknown
- Performance varies by user segment (+-5% by age group)
- Requires fresh user history; new users get baseline recommendations
## Fairness & Bias
- No demographic bias detected (age, gender have <2% AUC gap)
- Bias audit: Every quarter on demographic slices
## Next Steps
- A/B test against baseline (20% traffic)
- Monitor for data drift (weekly dashboard)
- Retrain monthly with new feedback
Key Properties by Algorithm
| Algorithm | Training Time | Inference Time | Data Needs | Interpretability | Robustness |
|---|---|---|---|---|---|
| Logistic Regression | sec | <1ms | 1k examples | Excellent | Excellent |
| Decision Tree | sec | <1ms | 1k examples | Excellent | Poor |
| Random Forest | min | 10ms | 10k examples | Good | Good |
| XGBoost | min–hour | 20ms | 10k examples | Medium (SHAP) | Excellent |
| Neural Network | hour | 10–100ms | 100k examples | Poor (black box) | Variable |
| Transformer | hours | 100–1000ms | 1M examples | Poor | Variable |
Implementation Example: Full Model Development
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from xgboost import XGBClassifier
# Set experiment
mlflow.set_experiment('click_prediction')
# Baseline
print("=== Baseline: Logistic Regression ===")
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {lr_scores.mean():.3f}")
baseline_auc = lr_scores.mean()
# Experiment 1: XGBoost with default params
print("\n=== Exp 1: XGBoost (default) ===")
with mlflow.start_run(run_name="xgb_default"):
xgb = XGBClassifier(random_state=42, n_jobs=-1)
xgb_scores = cross_val_score(xgb, X_train, y_train, cv=5, scoring='roc_auc')
mlflow.log_metric('cv_auc', xgb_scores.mean())
print(f"CV AUC: {xgb_scores.mean():.3f}")
if xgb_scores.mean() > baseline_auc:
print("Improvement over baseline!")
# Experiment 2: XGBoost with tuned hyperparameters
print("\n=== Exp 2: XGBoost (tuned) ===")
with mlflow.start_run(run_name="xgb_tuned"):
param_grid = {
'max_depth': [5, 7, 8],
'learning_rate': [0.05, 0.1],
'n_estimators': [200, 300]
}
grid_search = GridSearchCV(
XGBClassifier(random_state=42, n_jobs=-1),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
mlflow.log_params(grid_search.best_params_)
mlflow.log_metric('cv_auc', grid_search.best_score_)
print(f"Best CV AUC: {grid_search.best_score_:.3f}")
print(f"Best params: {grid_search.best_params_}")
# Evaluate on test set
best_model = grid_search.best_estimator_
test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
mlflow.log_metric('test_auc', test_auc)
print(f"Test AUC: {test_auc:.3f}")
# Compare experiments
print("\n=== Experiment Summary ===")
runs = mlflow.search_runs(experiment_names=['click_prediction'])
print(runs[['run_name', 'metrics.cv_auc', 'metrics.test_auc']].sort_values('metrics.test_auc', ascending=False))
How Real Companies Use This
Google Search Ranking (1000+ Experiments Per Year): Google’s search ranking system involves 1000+ A/B experiments yearly to improve NDCG (ranking quality). Model development cycle: teams propose ranking hypotheses, train candidate models on curated 10k query-document pairs (labels from human raters), evaluate offline using LambdaMART (learn-to-rank algorithm). Baseline: previous ranking model. Hyperparameter tuning via Vizier (Bayesian optimization) across 20+ dimensions (learning rate, tree depth, regularization). Single training run: 4–6 hours on 10k TPUs. Algorithm selection: start with LambdaMART (gradient boosting, interpretable feature importance), advance to BERT-based rankers for complex query understanding. Error analysis: NDCG computed per query type (navigational, informational, commercial) to detect segment failures. Experiment tracking: MLflow-equivalent system logs 1000s of runs, enabling comparison across teams. A/B testing: only models passing offline threshold (NDCG +0.5%) advance to online testing (1% traffic for 2 weeks). Key learning: 90% of experiments show no improvement; persistence is critical.
Meta’s Recommendation System Development (50+ Models in Parallel): Meta’s Feed ranking involves 50+ deep learning models operating in concert (embedding models, pairwise rankers, aggregation models). Model development at scale: teams experiment with novel architectures weekly (DLRM enhancements, multi-task learning variations). Baseline: previous ensemble. Data scale: 1TB+ daily user interactions (impressions, engagements). Hyperparameter tuning: HyperOpt searching 20+ dimensions (embedding dimensions: 32–512, MLP layer sizes, learning rates: 0.0001–0.1, dropout: 0.0–0.5). Single training: 30 minutes on 128 A100 GPUs. Error analysis: segment-level performance (tracked separately for users by age, geography, device) revealed cold-start users underperform (new users have sparse features). Solution: content-based features added as fallback for cold users. Experiment tracking: every model version logged with offline metrics (NDCG, coverage) and online metrics (CTR, watch time). A/B testing mandatory: new models tested on 1% of 3B DAU before broader rollout.
Spotify’s Podcast Recommendation (50+ Architectures Over 6 Months): Spotify’s podcast recommendation team experimented with 50+ model architectures over 6 months to improve podcast discovery. Baseline: collaborative filtering (existing system). Data: 1B+ podcast listens per week. Candidates: neural collaborative filtering, graph neural networks, two-tower deep networks, content-based (episode transcript embeddings). Hyperparameter tuning: Random search (50 trials) for each architecture across learning rate, embedding dimension, loss function. Training per trial: 2 hours on 4 GPUs. Cross-validation: stratified by user (new vs returning), podcast (new vs established) to detect overfitting to popular content. Error analysis: classical CF performs well on popular podcasts, neural networks excel on niche podcasts. Winner: two-tower model (separate embeddings for user, podcast; learned via contrastive loss). Performance: 8% better precision@20 vs baseline. A/B test: 10% traffic for 3 weeks, measuring podcast save rate and listen completion.
DoorDash Delivery Time Estimation (Complex Features, Continuous Iteration): DoorDash’s model development for delivery time prediction evolved from simple linear regression (baseline RMSE 15 min) to gradient boosting (RMSE 8 min) to neural networks (RMSE 7 min). Experimentation: 100+ trials per month testing feature engineering, hyperparameter choices, ensemble strategies. Baseline: historical average delivery time per restaurant. Algorithm selection: XGBoost chosen over neural networks due to 3x faster inference (serving latency <50ms critical). Hyperparameter tuning: Bayesian optimization (Optuna) over max_depth (3–15), learning_rate (0.01–0.5), subsample (0.5–1.0). Training: 2 hours on 8 GPUs per trial. Error analysis: performance varies by restaurant type (Chinese takeout: 5% error; fine dining: 20% error due to prep time variation). Segment-level training: separate models for delivery vs pickup (very different time distributions). Cross-validation: time-series split (train on past 60 days, validate on next 14 days) to avoid future leakage. Experiment tracking: 10k runs logged in MLflow, enabling team to see which combinations work.
Netflix’s Recommendation Algorithm Research (Incremental Improvements Over Years): Netflix’s algorithm development for personalized ranking is decades-long (starting with Cinematch in 2006). Each year: 50–100 experiments exploring collaborative filtering refinements, content-based embeddings, contextual factors. Baseline: previous champion model. Data: 500B+ ratings and implicit signals (plays, pauses, skips). Algorithm selection: gradient boosting (stable, interpretable) vs neural CF vs graph neural networks (GNNs to model social influence). Hyperparameter tuning: random search (50 trials) for each candidate, focusing on regularization (model generalizes beyond training data). Training: 6 hours on 16 GPUs per trial. Error analysis: NDCG per member segment (by country, device, subscription tier) revealed mobile users prefer shorter movies (time-constrained). Cold-start strategies: new members receive popularity-based recommendations until enough behavior accumulated. A/B testing: new models tested on 5–10% of 250M members, measuring watch hours (business metric) not just NDCG (ML metric). Learning: marginal improvements (1–2%) compound over years.
References
- Hands-On Machine Learning (Aurelien Geron) — Algorithms, tuning, validation
- Hyperparameter Optimization with Bayesian Optimization (Snoek et al., 2012) — Theory behind Bayesian tuning
- Optuna Documentation — Modern hyperparameter optimizer
- MLflow Tracking — Experiment management
- XGBoost Tutorial (Tianqi Chen) — Best practices
- The Hundred-Page Machine Learning Book (Andriy Burkov) — Concise reference on model selection
- SHAP Documentation — Model explainability