Evaluation Metrics
How to measure what matters: Choosing the right metric is as important as the algorithm. Wrong metric → wrong optimization → wrong outcomes.
How to measure what matters: Choosing the right metric is as important as the algorithm. Wrong metric → wrong optimization → wrong outcomes.
Metric Selection Guide
| Problem Type | Primary Metric | Secondary | When to Use |
|---|---|---|---|
| Binary Classification | ROC-AUC or PR-AUC | Precision, Recall, F1 | Imbalanced classes (fraud, disease) |
| Multiclass | Macro F1, Top-1 accuracy | Per-class recall | Multiple unequal importance classes |
| Regression | RMSE or R² | MAE, MAPE | Continuous prediction (price, demand) |
| Ranking | NDCG, MRR | MAP, HR@k | Top-k results matter (search, recommendations) |
| Clustering | Silhouette, Davies-Bouldin | Homogeneity, Completeness | No labeled data to evaluate |
| NLP | BLEU, ROUGE, METEOR | Perplexity, Exact Match | Machine translation, summarization, QA |
| Object Detection | mAP (mean Average Precision) | Precision/Recall curves | Images with multiple objects |
Classification Metrics
Confusion Matrix Foundation
For binary classification, 4 outcomes:
1
2
3
4
5
6
7
8
9
Predicted
Positive Negative
Actual Positive TP FN
Negative FP TN
TP = True Positive (correct prediction, positive class)
FP = False Positive (incorrect, predicted positive)
TN = True Negative (correct, predicted negative)
FN = False Negative (incorrect, predicted negative)
Key Metrics
Accuracy = (TP + TN) / (TP + FP + FN + TN)
- Definition: Fraction of correct predictions
- Use: Balanced classes only
- Problem: Ignores class imbalance. If 99% negative, always predicting “negative” gets 99% accuracy!
Precision = TP / (TP + FP)
- Definition: Of predicted positives, how many are correct?
- Use: Cost of false positives is high (medical diagnosis: false alarm wastes resources)
- Question: “If model says positive, what’s the probability it’s actually positive?”
Recall = TP / (TP + FN)
- Definition: Of actual positives, how many did we catch?
- Use: Cost of false negatives is high (cancer screening: miss = death)
- Question: “Of all actual positives, what fraction did we find?”
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
- Definition: Harmonic mean of precision and recall
- Use: Balanced score when both matter
- Range: 0 to 1 (1 = perfect)
ROC-AUC = Area Under the ROC Curve
- Definition: Plots true positive rate (recall) vs false positive rate across all thresholds
- Use: Robust to class imbalance, threshold-independent
- Production: Most common metric for imbalanced problems
PR-AUC = Area Under the Precision-Recall Curve
- Definition: Plots precision vs recall across thresholds
- Use: Better than ROC-AUC for highly imbalanced classes
- When: Positives are rare (fraud: 0.1%, disease: 1%)
Regression Metrics
| **Mean Absolute Error (MAE) = (1/n) × Σ | y_actual - y_pred | ** |
- Definition: Average absolute error
- Units: Same as y (interpretable)
- Use: Robust to outliers
Root Mean Squared Error (RMSE) = √((1/n) × Σ(y_actual - y_pred)²)
- Definition: Square root of average squared error
- Units: Same as y
- Use: Penalizes large errors (outliers matter more than MAE)
R² = 1 - (SS_res / SS_tot)
- Definition: Fraction of variance explained
- Range: 0 to 1 (1 = perfect fit, 0 = model is no better than mean)
- Use: Standardized, easy to compare across datasets
| **Mean Absolute Percentage Error (MAPE) = (1/n) × Σ | y_actual - y_pred | / | y_actual | × 100%** |
- Definition: Average percentage error
- Units: Percentage
- Use: Comparing across different scales (e.g., small vs large house prices)
- Problem: Undefined when y_actual = 0
Production Example: Uber predicts delivery time. RMSE = 2 minutes (average error). Users tolerate 95% accuracy if errors are normally distributed.
Ranking / Recommendation Metrics
Precision@k = (# relevant items in top-k) / k
- Example: Recommending 5 movies, 3 are relevant → Precision@5 = 60%
Recall@k = (# relevant in top-k) / (# total relevant)
- Example: 10 movies user likes, 6 in top-20 recommendations → Recall@20 = 60%
NDCG@k (Normalized Discounted Cumulative Gain)
- Definition: Position matters. Relevant at position 1 > position 10
- Formula: DCG = Σ(rel_i / log₂(i+1)) normalized by ideal ranking
- Use: More sophisticated than Precision@k
- Production: Google, Netflix, Amazon use NDCG for ranking evaluation
Hit Rate@k = P(relevant item in top-k)
- Simpler than recall. Just “was there anything good?”
NLP-Specific Metrics
BLEU (Bilingual Evaluation Understudy)
- Definition: Compares n-gram overlap between generated and reference text
- Range: 0 to 1
- Use: Machine translation quality
- Caveat: Correlation with human judgment is moderate (0.4–0.6)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- Definition: Compares reference summaries with system summaries
- Variants: ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
- Use: Summarization quality
- Production: Google News, Facebook news feed use ROUGE variants
METEOR (Metric for Evaluation of Translation with Explicit Ordering)
- Definition: More sophisticated than BLEU. Aligns n-grams, accounts for synonyms, paraphrases
- Use: Machine translation (better correlation with human judgment than BLEU)
Perplexity
- Definition: 2^(-avg(log P(word)))
- Interpretation: “How surprised is the model on average?” Lower is better
- Use: Language model evaluation
Exact Match (EM) + F1
- Definition: Perfect match vs best partial match
- Use: Question-answering evaluation (e.g., SQuAD benchmark)
- Production: Google Assistant, Amazon Alexa use F1 for intent matching
Clustering Metrics (No Labels)
Silhouette Coefficient
- Definition: Measures cohesion (points close to cluster center) vs separation (clusters far apart)
- Range: -1 to 1 (1 = well-separated clusters)
- Formula: (b - a) / max(a, b) where a = avg distance to cluster points, b = avg distance to nearest other cluster
Davies-Bouldin Index
- Definition: Ratio of within-cluster to between-cluster distances
- Lower is better
- Use: Evaluation without labels
Calinski-Harabasz Index
- Definition: Ratio of between-cluster spread to within-cluster spread
- Higher is better
- Use: Quick evaluation metric
Production Metrics
Beyond model accuracy, production ML tracks:
Latency - Inference time per request
- SLA: <100ms for real-time (recommendations, autocomplete)
- Batch: <1s per 1000 examples
Throughput - Requests per second
- Stripe: 10K+ fraud detection requests/sec
- Google: 1M+ searches/sec
Cost
- Per-inference cost (infrastructure, compute)
- Per-model cost (training, retraining, maintenance)
Fairness Metrics
- Disparate Impact: Does model favor one demographic?
-
Calibration: P(positive score=s) ≈ s
Data Drift
- Monitors change in input distribution
- Triggers retraining if drift > threshold
Metric Choice Examples
Fraud Detection (Stripe)
- Primary: Precision (false positives block legitimate transactions)
- Secondary: Recall (false negatives = fraud loss)
- Metric: PR-AUC (data is 99.9% legitimate, 0.1% fraud)
- Target: 95% precision, 75% recall
Disease Screening (Healthcare)
- Primary: Recall (miss = death)
- Secondary: Precision (false positive = unnecessary treatment)
- Metric: Sensitivity (recall), Specificity (TNR)
- Target: >99% recall (catch all positives)
Product Recommendation (Amazon)
- Primary: Conversion rate (what % buy after recommendation?)
- Secondary: NDCG@5 (ranking quality)
- Metric: A/B testing + NDCG
- Target: Lift >2% vs baseline
Machine Translation (Google Translate)
- Primary: BLEU, METEOR (correlation with human judgment)
- Secondary: Inference latency (<500ms)
- Production: Human raters score translation quality 1-5, correlation with BLEU ~0.5
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.metrics import *
# Classification
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]
accuracy_score(y_true, y_pred) # 0.8
precision_score(y_true, y_pred) # 1.0
recall_score(y_true, y_pred) # 0.67
f1_score(y_true, y_pred) # 0.8
roc_auc_score(y_true, y_pred_proba) # Area under ROC
# Regression
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred) # 0.5
mean_squared_error(y_true, y_pred) # 0.375
np.sqrt(mean_squared_error(y_true, y_pred)) # RMSE = 0.612
r2_score(y_true, y_pred) # 0.948
# Clustering (no labels)
from sklearn.metrics import silhouette_score
silhouette_score(X, labels) # -1 to 1
References
📄 Precision and Recall (Wikipedia) 📄 Receiver Operating Characteristic (Wikipedia) 📄 BLEU: A Method for Automatic Evaluation (Papineni et al., 2002) 📄 ROUGE: A Package for Automatic Evaluation (Lin, 2004) 📖 Scikit-learn Metrics 🎥 Fast.ai: Metrics & Validation