Post

Evaluation Metrics

How to measure what matters: Choosing the right metric is as important as the algorithm. Wrong metric → wrong optimization → wrong outcomes.

Evaluation Metrics

How to measure what matters: Choosing the right metric is as important as the algorithm. Wrong metric → wrong optimization → wrong outcomes.

Metric Selection Guide

Problem Type Primary Metric Secondary When to Use
Binary Classification ROC-AUC or PR-AUC Precision, Recall, F1 Imbalanced classes (fraud, disease)
Multiclass Macro F1, Top-1 accuracy Per-class recall Multiple unequal importance classes
Regression RMSE or R² MAE, MAPE Continuous prediction (price, demand)
Ranking NDCG, MRR MAP, HR@k Top-k results matter (search, recommendations)
Clustering Silhouette, Davies-Bouldin Homogeneity, Completeness No labeled data to evaluate
NLP BLEU, ROUGE, METEOR Perplexity, Exact Match Machine translation, summarization, QA
Object Detection mAP (mean Average Precision) Precision/Recall curves Images with multiple objects

Classification Metrics

Confusion Matrix Foundation

For binary classification, 4 outcomes:

1
2
3
4
5
6
7
8
9
                Predicted
                Positive  Negative
Actual Positive    TP       FN
       Negative    FP       TN

TP = True Positive   (correct prediction, positive class)
FP = False Positive  (incorrect, predicted positive)
TN = True Negative   (correct, predicted negative)
FN = False Negative  (incorrect, predicted negative)

Key Metrics

Accuracy = (TP + TN) / (TP + FP + FN + TN)

  • Definition: Fraction of correct predictions
  • Use: Balanced classes only
  • Problem: Ignores class imbalance. If 99% negative, always predicting “negative” gets 99% accuracy!

Precision = TP / (TP + FP)

  • Definition: Of predicted positives, how many are correct?
  • Use: Cost of false positives is high (medical diagnosis: false alarm wastes resources)
  • Question: “If model says positive, what’s the probability it’s actually positive?”

Recall = TP / (TP + FN)

  • Definition: Of actual positives, how many did we catch?
  • Use: Cost of false negatives is high (cancer screening: miss = death)
  • Question: “Of all actual positives, what fraction did we find?”

F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

  • Definition: Harmonic mean of precision and recall
  • Use: Balanced score when both matter
  • Range: 0 to 1 (1 = perfect)

ROC-AUC = Area Under the ROC Curve

  • Definition: Plots true positive rate (recall) vs false positive rate across all thresholds
  • Use: Robust to class imbalance, threshold-independent
  • Production: Most common metric for imbalanced problems

PR-AUC = Area Under the Precision-Recall Curve

  • Definition: Plots precision vs recall across thresholds
  • Use: Better than ROC-AUC for highly imbalanced classes
  • When: Positives are rare (fraud: 0.1%, disease: 1%)

Regression Metrics

**Mean Absolute Error (MAE) = (1/n) × Σ y_actual - y_pred **
  • Definition: Average absolute error
  • Units: Same as y (interpretable)
  • Use: Robust to outliers

Root Mean Squared Error (RMSE) = √((1/n) × Σ(y_actual - y_pred)²)

  • Definition: Square root of average squared error
  • Units: Same as y
  • Use: Penalizes large errors (outliers matter more than MAE)

R² = 1 - (SS_res / SS_tot)

  • Definition: Fraction of variance explained
  • Range: 0 to 1 (1 = perfect fit, 0 = model is no better than mean)
  • Use: Standardized, easy to compare across datasets
**Mean Absolute Percentage Error (MAPE) = (1/n) × Σ y_actual - y_pred / y_actual × 100%**
  • Definition: Average percentage error
  • Units: Percentage
  • Use: Comparing across different scales (e.g., small vs large house prices)
  • Problem: Undefined when y_actual = 0

Production Example: Uber predicts delivery time. RMSE = 2 minutes (average error). Users tolerate 95% accuracy if errors are normally distributed.


Ranking / Recommendation Metrics

Precision@k = (# relevant items in top-k) / k

  • Example: Recommending 5 movies, 3 are relevant → Precision@5 = 60%

Recall@k = (# relevant in top-k) / (# total relevant)

  • Example: 10 movies user likes, 6 in top-20 recommendations → Recall@20 = 60%

NDCG@k (Normalized Discounted Cumulative Gain)

  • Definition: Position matters. Relevant at position 1 > position 10
  • Formula: DCG = Σ(rel_i / log₂(i+1)) normalized by ideal ranking
  • Use: More sophisticated than Precision@k
  • Production: Google, Netflix, Amazon use NDCG for ranking evaluation

Hit Rate@k = P(relevant item in top-k)

  • Simpler than recall. Just “was there anything good?”

NLP-Specific Metrics

BLEU (Bilingual Evaluation Understudy)

  • Definition: Compares n-gram overlap between generated and reference text
  • Range: 0 to 1
  • Use: Machine translation quality
  • Caveat: Correlation with human judgment is moderate (0.4–0.6)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • Definition: Compares reference summaries with system summaries
  • Variants: ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
  • Use: Summarization quality
  • Production: Google News, Facebook news feed use ROUGE variants

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

  • Definition: More sophisticated than BLEU. Aligns n-grams, accounts for synonyms, paraphrases
  • Use: Machine translation (better correlation with human judgment than BLEU)

Perplexity

  • Definition: 2^(-avg(log P(word)))
  • Interpretation: “How surprised is the model on average?” Lower is better
  • Use: Language model evaluation

Exact Match (EM) + F1

  • Definition: Perfect match vs best partial match
  • Use: Question-answering evaluation (e.g., SQuAD benchmark)
  • Production: Google Assistant, Amazon Alexa use F1 for intent matching

Clustering Metrics (No Labels)

Silhouette Coefficient

  • Definition: Measures cohesion (points close to cluster center) vs separation (clusters far apart)
  • Range: -1 to 1 (1 = well-separated clusters)
  • Formula: (b - a) / max(a, b) where a = avg distance to cluster points, b = avg distance to nearest other cluster

Davies-Bouldin Index

  • Definition: Ratio of within-cluster to between-cluster distances
  • Lower is better
  • Use: Evaluation without labels

Calinski-Harabasz Index

  • Definition: Ratio of between-cluster spread to within-cluster spread
  • Higher is better
  • Use: Quick evaluation metric

Production Metrics

Beyond model accuracy, production ML tracks:

Latency - Inference time per request

  • SLA: <100ms for real-time (recommendations, autocomplete)
  • Batch: <1s per 1000 examples

Throughput - Requests per second

  • Stripe: 10K+ fraud detection requests/sec
  • Google: 1M+ searches/sec

Cost

  • Per-inference cost (infrastructure, compute)
  • Per-model cost (training, retraining, maintenance)

Fairness Metrics

  • Disparate Impact: Does model favor one demographic?
  • Calibration: P(positive score=s) ≈ s

Data Drift

  • Monitors change in input distribution
  • Triggers retraining if drift > threshold

Metric Choice Examples

Fraud Detection (Stripe)

  • Primary: Precision (false positives block legitimate transactions)
  • Secondary: Recall (false negatives = fraud loss)
  • Metric: PR-AUC (data is 99.9% legitimate, 0.1% fraud)
  • Target: 95% precision, 75% recall

Disease Screening (Healthcare)

  • Primary: Recall (miss = death)
  • Secondary: Precision (false positive = unnecessary treatment)
  • Metric: Sensitivity (recall), Specificity (TNR)
  • Target: >99% recall (catch all positives)

Product Recommendation (Amazon)

  • Primary: Conversion rate (what % buy after recommendation?)
  • Secondary: NDCG@5 (ranking quality)
  • Metric: A/B testing + NDCG
  • Target: Lift >2% vs baseline

Machine Translation (Google Translate)

  • Primary: BLEU, METEOR (correlation with human judgment)
  • Secondary: Inference latency (<500ms)
  • Production: Human raters score translation quality 1-5, correlation with BLEU ~0.5

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.metrics import *

# Classification
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]

accuracy_score(y_true, y_pred)                    # 0.8
precision_score(y_true, y_pred)                   # 1.0
recall_score(y_true, y_pred)                      # 0.67
f1_score(y_true, y_pred)                          # 0.8
roc_auc_score(y_true, y_pred_proba)               # Area under ROC

# Regression
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mean_absolute_error(y_true, y_pred)               # 0.5
mean_squared_error(y_true, y_pred)                # 0.375
np.sqrt(mean_squared_error(y_true, y_pred))       # RMSE = 0.612
r2_score(y_true, y_pred)                          # 0.948

# Clustering (no labels)
from sklearn.metrics import silhouette_score
silhouette_score(X, labels)                       # -1 to 1

References

📄 Precision and Recall (Wikipedia) 📄 Receiver Operating Characteristic (Wikipedia) 📄 BLEU: A Method for Automatic Evaluation (Papineni et al., 2002) 📄 ROUGE: A Package for Automatic Evaluation (Lin, 2004) 📖 Scikit-learn Metrics 🎥 Fast.ai: Metrics & Validation

This post is licensed under CC BY 4.0 by the author.