Model Development

Systematic experimentation and iteration: Train candidate algorithms, tune hyperparameters, validate robustly, analyze failures, compare, and select the best model for production.

Posted Nov 25, 2025

13 min read

Model Development

Systematic experimentation & iteration: Train candidate algorithms, tune hyperparameters, validate robustly, analyze failures, compare, and select the best model for production.

The Experimentation Loop

Model development is rapid iteration:

Baseline Model (simple, interpretable)
         ↓
Interpret results (what works? what fails?)
         ↓
Hypothesis (feature engineering, algorithm change)
         ↓
Experiment (train new model, compare)
         ↓
Better? → YES → Update best model
         ↓
         NO → Try different hypothesis

Each experiment takes hours to days; companies run 100+ experiments per week at scale.

Algorithm Selection

Golden Rule: Start simple, add complexity only when needed.

Algorithm by Problem Type

Problem Type	Algorithms	Data Size	Training Time	Interpretability
Binary Classification	Log Reg, Decision Tree, XGBoost, Neural Net	1k–1M	sec–min	Tree > Log Reg > NN
Multiclass Classification	Same as binary	1k–1M	sec–min	Same
Regression	Linear, Ridge, XGBoost, Neural Net	1k–1M	sec–min	Linear > XGBoost
Ranking	LambdaMART, Neural Ranker (LTR)	100k–10M	min–hour	LambdaMART
Clustering	K-Means, DBSCAN, Gaussian Mixture	1k–1M	sec	K-Means
Anomaly Detection	Isolation Forest, Autoencoders	10k–1M	min–hour	Isolation Forest

Algorithm Selection Criteria

Data Size:

Small (<10k): Linear models (logistic regression) or tree-based (avoid overfitting)
Medium (10k–1M): Gradient boosting (XGBoost, LightGBM) — best all-around
Large (>1M): Neural networks (leverage large data) or ensemble methods

Latency Budget:

<10ms: Tree-based (fast inference)
50–200ms: Gradient boosting or simple neural net
>1s: Complex neural nets, ensembles

Data Type:

Tabular: XGBoost, gradient boosting
Images: Convolutional neural networks (ResNet, EfficientNet)
Text: Transformers (BERT, GPT), gradient boosting with text features
Time-series: RNNs (LSTM), Transformers, ARIMA

Interpretability Requirement:

High (finance, healthcare): Linear models, decision trees, SHAP explanations
Medium: XGBoost (feature importance available)
Low: Neural networks, ensembles

Baseline Model

Always establish a baseline. Baselines answer: “Is our fancy algorithm better than simple?”

        
      
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Baseline 1: Logistic regression (linear, interpretable)
lr_baseline = LogisticRegression(max_iter=1000)
lr_baseline.fit(X_train, y_train)
lr_auc = roc_auc_score(y_val, lr_baseline.predict_proba(X_val)[:, 1])
print(f"Baseline (Logistic Regression) AUC: {lr_auc:.3f}")

# Baseline 2: Decision tree (non-linear, simple)
dt_baseline = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_baseline.fit(X_train, y_train)
dt_auc = roc_auc_score(y_val, dt_baseline.predict_proba(X_val)[:, 1])
print(f"Baseline (Decision Tree) AUC: {dt_auc:.3f}")

# Record baseline for comparison
baseline_auc = max(lr_auc, dt_auc)
print(f"\nBaseline AUC to beat: {baseline_auc:.3f}")

Cross-Validation

Cross-validation protects against overfitting on validation set. It’s not optional; it’s mandatory.

K-Fold Cross-Validation

        
      
from sklearn.model_selection import cross_val_score, KFold

# Standard k-fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
# Output: CV AUC: 0.720 +/- 0.025 means model averages 0.720, varies by 2.5%

# Stratified k-fold (for imbalanced data)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')

Time-Series Cross-Validation

For time-series, don’t use random CV (leakage). Use temporal splits:

        
      
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X_train):
    X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[test_idx]
    # Train on past, validate on future

Hyperparameter Tuning

Hyperparameters are knobs that control model complexity (learning rate, tree depth, regularization).

Tuning Strategies

Grid Search (Exhaustive):

        
      
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.1, 0.5],
    'n_estimators': [100, 200, 500]
}

grid_search = GridSearchCV(
    XGBClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1  # Parallel
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

Random Search (Faster, good for large spaces):

        
      
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'max_depth': [3, 5, 7, 10, 15],
    'learning_rate': [0.001, 0.01, 0.1, 0.5],
    'min_child_weight': [1, 3, 5, 7],
    'subsample': [0.5, 0.7, 0.9, 1.0]
}

random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42),
    param_dist,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
random_search.fit(X_train, y_train)

Bayesian Optimization (Smart, efficient):

        
      
from optuna import create_study
from xgboost import XGBClassifier

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.5),
        'n_estimators': trial.suggest_int('n_estimators', 50, 500)
    }

    model = XGBClassifier(**params, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    return scores.mean()

study = create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(f"Best AUC: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")

Error Analysis

After training, analyze where the model fails. This drives next experiments.

        
      
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report

# Predictions on validation set
y_pred = model.predict(X_val)
y_pred_proba = model.predict_proba(X_val)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_val, y_pred)
print(f"""
Confusion Matrix:
  Predicted No | Predicted Yes
Actual No:     {cm[0,0]:7d} |      {cm[0,1]:7d}
Actual Yes:    {cm[1,0]:7d} |      {cm[1,1]:7d}
""")

# Classification report
print(classification_report(y_val, y_pred))

# Error analysis: segment performance
# Where does model fail most?
errors_df = X_val.copy()
errors_df['actual'] = y_val.values
errors_df['predicted'] = y_pred
errors_df['correct'] = (y_val.values == y_pred)

print("\nAccuracy by user age group:")
print(errors_df.groupby(pd.cut(errors_df['user_age'], bins=[0, 25, 35, 50, 100]))['correct'].mean())

print("\nAccuracy by item category:")
print(errors_df.groupby('item_category')['correct'].mean().sort_values())
# Identify underperforming segments → retrain with more data or different approach

Learning Curves

Learning curves reveal if model needs more data or regularization:

        
      
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    model, X_train, y_train, cv=5, scoring='roc_auc',
    train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training AUC')
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation AUC')
plt.xlabel('Training Set Size')
plt.ylabel('AUC')
plt.legend()
plt.title('Learning Curve')
plt.show()

# Interpretation:
# - Gap between train/val: overfitting → regularize
# - Both curves low: underfitting → more data or complex model
# - Both curves high and close: good! Add more data for marginal gains

Experiment Tracking

Problem: Without logging, you lose track of what worked.

Solution: MLflow (or Weights & Biases, Neptune)

        
      
import mlflow
import mlflow.sklearn

# Start experiment run
mlflow.start_run(run_name="xgb_v2_increased_depth")

# Log parameters
mlflow.log_params({
    'algorithm': 'XGBoost',
    'max_depth': 8,
    'learning_rate': 0.05,
    'n_estimators': 200
})

# Log metrics
model = XGBClassifier(max_depth=8, learning_rate=0.05, n_estimators=200)
model.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
test_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

mlflow.log_metrics({
    'train_auc': train_auc,
    'val_auc': val_auc,
    'test_auc': test_auc
})

# Log model
mlflow.sklearn.log_model(model, 'model')

# Log artifacts (plots, reports)
mlflow.log_artifact('learning_curve.png')

mlflow.end_run()

# Later: view all experiments
mlflow.search_runs(experiment_names=['my_project'])
# Compare: XGB v1 (AUC 0.72) vs XGB v2 (AUC 0.75) → v2 better!

Model Comparison

After N experiments, select the best model:

Model	Algorithm	Train AUC	Val AUC	Test AUC	Latency	Training Time	Complexity
Baseline	Logistic Reg	0.680	0.672	0.675	1ms	10s	Low
v1	XGBoost (d=5)	0.710	0.705	0.708	15ms	1m	Medium
v2	XGBoost (d=8)	0.740	0.722	0.720	20ms	2m	Medium
v3	Neural Net (128-64)	0.745	0.715	0.712	80ms	10m	High
v4	Ensemble (XGB+NN)	0.750	0.725	0.723	90ms	12m	Very High

Decision: v2 wins (test AUC 0.720, reasonable latency). v3/v4 overfit despite better training scores.

Model Card & Documentation

Before deployment, document the model:

        
      
# Model Card: Click Prediction v2

## Overview
XGBoost model predicting ad click probability.
Optimized for AUC (ranking quality); latency budget <50ms.

## Performance (Test Set)
- AUC: 0.720
- Precision@1%: 0.45 (45% of top predictions are correct)
- Recall@1%: 0.62 (catches 62% of clicks in top 1%)

## Training Details
- Algorithm: XGBoost
- Hyperparameters: max_depth=8, learning_rate=0.05, n_estimators=200
- Training data: 80% of 100M user-item interactions (2025-01-01 to 2025-11-30)
- Training time: 2 hours on 8 GPUs

## Limitations
- Model trained on web traffic; mobile app performance unknown
- Performance varies by user segment (+-5% by age group)
- Requires fresh user history; new users get baseline recommendations

## Fairness & Bias
- No demographic bias detected (age, gender have <2% AUC gap)
- Bias audit: Every quarter on demographic slices

## Next Steps
- A/B test against baseline (20% traffic)
- Monitor for data drift (weekly dashboard)
- Retrain monthly with new feedback

Key Properties by Algorithm

Algorithm	Training Time	Inference Time	Data Needs	Interpretability	Robustness
Logistic Regression	sec	<1ms	1k examples	Excellent	Excellent
Decision Tree	sec	<1ms	1k examples	Excellent	Poor
Random Forest	min	10ms	10k examples	Good	Good
XGBoost	min–hour	20ms	10k examples	Medium (SHAP)	Excellent
Neural Network	hour	10–100ms	100k examples	Poor (black box)	Variable
Transformer	hours	100–1000ms	1M examples	Poor	Variable

Implementation Example: Full Model Development

        
      
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from xgboost import XGBClassifier

# Set experiment
mlflow.set_experiment('click_prediction')

# Baseline
print("=== Baseline: Logistic Regression ===")
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {lr_scores.mean():.3f}")
baseline_auc = lr_scores.mean()

# Experiment 1: XGBoost with default params
print("\n=== Exp 1: XGBoost (default) ===")
with mlflow.start_run(run_name="xgb_default"):
    xgb = XGBClassifier(random_state=42, n_jobs=-1)
    xgb_scores = cross_val_score(xgb, X_train, y_train, cv=5, scoring='roc_auc')
    mlflow.log_metric('cv_auc', xgb_scores.mean())
    print(f"CV AUC: {xgb_scores.mean():.3f}")
    if xgb_scores.mean() > baseline_auc:
        print("Improvement over baseline!")

# Experiment 2: XGBoost with tuned hyperparameters
print("\n=== Exp 2: XGBoost (tuned) ===")
with mlflow.start_run(run_name="xgb_tuned"):
    param_grid = {
        'max_depth': [5, 7, 8],
        'learning_rate': [0.05, 0.1],
        'n_estimators': [200, 300]
    }

    grid_search = GridSearchCV(
        XGBClassifier(random_state=42, n_jobs=-1),
        param_grid,
        cv=5,
        scoring='roc_auc',
        n_jobs=-1
    )
    grid_search.fit(X_train, y_train)

    mlflow.log_params(grid_search.best_params_)
    mlflow.log_metric('cv_auc', grid_search.best_score_)
    print(f"Best CV AUC: {grid_search.best_score_:.3f}")
    print(f"Best params: {grid_search.best_params_}")

    # Evaluate on test set
    best_model = grid_search.best_estimator_
    test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
    mlflow.log_metric('test_auc', test_auc)
    print(f"Test AUC: {test_auc:.3f}")

# Compare experiments
print("\n=== Experiment Summary ===")
runs = mlflow.search_runs(experiment_names=['click_prediction'])
print(runs[['run_name', 'metrics.cv_auc', 'metrics.test_auc']].sort_values('metrics.test_auc', ascending=False))

How Real Companies Use This

Google Search Ranking (1000+ Experiments Per Year): Google’s search ranking system involves 1000+ A/B experiments yearly to improve NDCG (ranking quality). Model development cycle: teams propose ranking hypotheses, train candidate models on curated 10k query-document pairs (labels from human raters), evaluate offline using LambdaMART (learn-to-rank algorithm). Baseline: previous ranking model. Hyperparameter tuning via Vizier (Bayesian optimization) across 20+ dimensions (learning rate, tree depth, regularization). Single training run: 4–6 hours on 10k TPUs. Algorithm selection: start with LambdaMART (gradient boosting, interpretable feature importance), advance to BERT-based rankers for complex query understanding. Error analysis: NDCG computed per query type (navigational, informational, commercial) to detect segment failures. Experiment tracking: MLflow-equivalent system logs 1000s of runs, enabling comparison across teams. A/B testing: only models passing offline threshold (NDCG +0.5%) advance to online testing (1% traffic for 2 weeks). Key learning: 90% of experiments show no improvement; persistence is critical.

Meta’s Recommendation System Development (50+ Models in Parallel): Meta’s Feed ranking involves 50+ deep learning models operating in concert (embedding models, pairwise rankers, aggregation models). Model development at scale: teams experiment with novel architectures weekly (DLRM enhancements, multi-task learning variations). Baseline: previous ensemble. Data scale: 1TB+ daily user interactions (impressions, engagements). Hyperparameter tuning: HyperOpt searching 20+ dimensions (embedding dimensions: 32–512, MLP layer sizes, learning rates: 0.0001–0.1, dropout: 0.0–0.5). Single training: 30 minutes on 128 A100 GPUs. Error analysis: segment-level performance (tracked separately for users by age, geography, device) revealed cold-start users underperform (new users have sparse features). Solution: content-based features added as fallback for cold users. Experiment tracking: every model version logged with offline metrics (NDCG, coverage) and online metrics (CTR, watch time). A/B testing mandatory: new models tested on 1% of 3B DAU before broader rollout.

Spotify’s Podcast Recommendation (50+ Architectures Over 6 Months): Spotify’s podcast recommendation team experimented with 50+ model architectures over 6 months to improve podcast discovery. Baseline: collaborative filtering (existing system). Data: 1B+ podcast listens per week. Candidates: neural collaborative filtering, graph neural networks, two-tower deep networks, content-based (episode transcript embeddings). Hyperparameter tuning: Random search (50 trials) for each architecture across learning rate, embedding dimension, loss function. Training per trial: 2 hours on 4 GPUs. Cross-validation: stratified by user (new vs returning), podcast (new vs established) to detect overfitting to popular content. Error analysis: classical CF performs well on popular podcasts, neural networks excel on niche podcasts. Winner: two-tower model (separate embeddings for user, podcast; learned via contrastive loss). Performance: 8% better precision@20 vs baseline. A/B test: 10% traffic for 3 weeks, measuring podcast save rate and listen completion.

DoorDash Delivery Time Estimation (Complex Features, Continuous Iteration): DoorDash’s model development for delivery time prediction evolved from simple linear regression (baseline RMSE 15 min) to gradient boosting (RMSE 8 min) to neural networks (RMSE 7 min). Experimentation: 100+ trials per month testing feature engineering, hyperparameter choices, ensemble strategies. Baseline: historical average delivery time per restaurant. Algorithm selection: XGBoost chosen over neural networks due to 3x faster inference (serving latency <50ms critical). Hyperparameter tuning: Bayesian optimization (Optuna) over max_depth (3–15), learning_rate (0.01–0.5), subsample (0.5–1.0). Training: 2 hours on 8 GPUs per trial. Error analysis: performance varies by restaurant type (Chinese takeout: 5% error; fine dining: 20% error due to prep time variation). Segment-level training: separate models for delivery vs pickup (very different time distributions). Cross-validation: time-series split (train on past 60 days, validate on next 14 days) to avoid future leakage. Experiment tracking: 10k runs logged in MLflow, enabling team to see which combinations work.

Netflix’s Recommendation Algorithm Research (Incremental Improvements Over Years): Netflix’s algorithm development for personalized ranking is decades-long (starting with Cinematch in 2006). Each year: 50–100 experiments exploring collaborative filtering refinements, content-based embeddings, contextual factors. Baseline: previous champion model. Data: 500B+ ratings and implicit signals (plays, pauses, skips). Algorithm selection: gradient boosting (stable, interpretable) vs neural CF vs graph neural networks (GNNs to model social influence). Hyperparameter tuning: random search (50 trials) for each candidate, focusing on regularization (model generalizes beyond training data). Training: 6 hours on 16 GPUs per trial. Error analysis: NDCG per member segment (by country, device, subscription tier) revealed mobile users prefer shorter movies (time-constrained). Cold-start strategies: new members receive popularity-based recommendations until enough behavior accumulated. A/B testing: new models tested on 5–10% of 250M members, measuring watch hours (business metric) not just NDCG (ML metric). Learning: marginal improvements (1–2%) compound over years.

References

Hands-On Machine Learning (Aurelien Geron) — Algorithms, tuning, validation
Hyperparameter Optimization with Bayesian Optimization (Snoek et al., 2012) — Theory behind Bayesian tuning
Optuna Documentation — Modern hyperparameter optimizer
MLflow Tracking — Experiment management
XGBoost Tutorial (Tianqi Chen) — Best practices
The Hundred-Page Machine Learning Book (Andriy Burkov) — Concise reference on model selection
SHAP Documentation — Model explainability

AI & Agents, AI Ops

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.