Evaluation Metrics

How to measure what matters: Choosing the right metric is as important as the algorithm. Wrong metric → wrong optimization → wrong outcomes.

Posted Feb 15, 2025

6 min read

Evaluation Metrics

How to measure what matters: Choosing the right metric is as important as the algorithm. Wrong metric → wrong optimization → wrong outcomes.

Metric Selection Guide

Problem Type	Primary Metric	Secondary	When to Use
Binary Classification	ROC-AUC or PR-AUC	Precision, Recall, F1	Imbalanced classes (fraud, disease)
Multiclass	Macro F1, Top-1 accuracy	Per-class recall	Multiple unequal importance classes
Regression	RMSE or R²	MAE, MAPE	Continuous prediction (price, demand)
Ranking	NDCG, MRR	MAP, HR@k	Top-k results matter (search, recommendations)
Clustering	Silhouette, Davies-Bouldin	Homogeneity, Completeness	No labeled data to evaluate
NLP	BLEU, ROUGE, METEOR	Perplexity, Exact Match	Machine translation, summarization, QA
Object Detection	mAP (mean Average Precision)	Precision/Recall curves	Images with multiple objects

Classification Metrics

Confusion Matrix Foundation

For binary classification, 4 outcomes:

                Predicted
                Positive  Negative
Actual Positive    TP       FN
       Negative    FP       TN

TP = True Positive   (correct prediction, positive class)
FP = False Positive  (incorrect, predicted positive)
TN = True Negative   (correct, predicted negative)
FN = False Negative  (incorrect, predicted negative)

Key Metrics

Accuracy = (TP + TN) / (TP + FP + FN + TN)

Definition: Fraction of correct predictions
Use: Balanced classes only
Problem: Ignores class imbalance. If 99% negative, always predicting “negative” gets 99% accuracy!

Precision = TP / (TP + FP)

Definition: Of predicted positives, how many are correct?
Use: Cost of false positives is high (medical diagnosis: false alarm wastes resources)
Question: “If model says positive, what’s the probability it’s actually positive?”

Recall = TP / (TP + FN)

Definition: Of actual positives, how many did we catch?
Use: Cost of false negatives is high (cancer screening: miss = death)
Question: “Of all actual positives, what fraction did we find?”

F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

Definition: Harmonic mean of precision and recall
Use: Balanced score when both matter
Range: 0 to 1 (1 = perfect)

ROC-AUC = Area Under the ROC Curve

Definition: Plots true positive rate (recall) vs false positive rate across all thresholds
Use: Robust to class imbalance, threshold-independent
Production: Most common metric for imbalanced problems

PR-AUC = Area Under the Precision-Recall Curve

Definition: Plots precision vs recall across thresholds
Use: Better than ROC-AUC for highly imbalanced classes
When: Positives are rare (fraud: 0.1%, disease: 1%)

Regression Metrics

**Mean Absolute Error (MAE) = (1/n) × Σ

y_actual - y_pred

Definition: Average absolute error
Units: Same as y (interpretable)
Use: Robust to outliers

Root Mean Squared Error (RMSE) = √((1/n) × Σ(y_actual - y_pred)²)

Definition: Square root of average squared error
Units: Same as y
Use: Penalizes large errors (outliers matter more than MAE)

R² = 1 - (SS_res / SS_tot)

Definition: Fraction of variance explained
Range: 0 to 1 (1 = perfect fit, 0 = model is no better than mean)
Use: Standardized, easy to compare across datasets

**Mean Absolute Percentage Error (MAPE) = (1/n) × Σ

y_actual - y_pred

y_actual

× 100%**

Definition: Average percentage error
Units: Percentage
Use: Comparing across different scales (e.g., small vs large house prices)
Problem: Undefined when y_actual = 0

Production Example: Uber predicts delivery time. RMSE = 2 minutes (average error). Users tolerate 95% accuracy if errors are normally distributed.

Ranking / Recommendation Metrics

Precision@k = (# relevant items in top-k) / k

Example: Recommending 5 movies, 3 are relevant → Precision@5 = 60%

Recall@k = (# relevant in top-k) / (# total relevant)

Example: 10 movies user likes, 6 in top-20 recommendations → Recall@20 = 60%

NDCG@k (Normalized Discounted Cumulative Gain)

Definition: Position matters. Relevant at position 1 > position 10
Formula: DCG = Σ(rel_i / log₂(i+1)) normalized by ideal ranking
Use: More sophisticated than Precision@k
Production: Google, Netflix, Amazon use NDCG for ranking evaluation

Hit Rate@k = P(relevant item in top-k)

Simpler than recall. Just “was there anything good?”

NLP-Specific Metrics

BLEU (Bilingual Evaluation Understudy)

Definition: Compares n-gram overlap between generated and reference text
Range: 0 to 1
Use: Machine translation quality
Caveat: Correlation with human judgment is moderate (0.4–0.6)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Definition: Compares reference summaries with system summaries
Variants: ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
Use: Summarization quality
Production: Google News, Facebook news feed use ROUGE variants

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Definition: More sophisticated than BLEU. Aligns n-grams, accounts for synonyms, paraphrases
Use: Machine translation (better correlation with human judgment than BLEU)

Perplexity

Definition: 2^(-avg(log P(word)))
Interpretation: “How surprised is the model on average?” Lower is better
Use: Language model evaluation

Exact Match (EM) + F1

Definition: Perfect match vs best partial match
Use: Question-answering evaluation (e.g., SQuAD benchmark)
Production: Google Assistant, Amazon Alexa use F1 for intent matching

Clustering Metrics (No Labels)

Silhouette Coefficient

Definition: Measures cohesion (points close to cluster center) vs separation (clusters far apart)
Range: -1 to 1 (1 = well-separated clusters)
Formula: (b - a) / max(a, b) where a = avg distance to cluster points, b = avg distance to nearest other cluster

Davies-Bouldin Index

Definition: Ratio of within-cluster to between-cluster distances
Lower is better
Use: Evaluation without labels

Calinski-Harabasz Index

Definition: Ratio of between-cluster spread to within-cluster spread
Higher is better
Use: Quick evaluation metric

Production Metrics

Beyond model accuracy, production ML tracks:

Latency - Inference time per request

SLA: <100ms for real-time (recommendations, autocomplete)
Batch: <1s per 1000 examples

Throughput - Requests per second

Stripe: 10K+ fraud detection requests/sec
Google: 1M+ searches/sec

Cost

Per-inference cost (infrastructure, compute)
Per-model cost (training, retraining, maintenance)

Fairness Metrics

Disparate Impact: Does model favor one demographic?
Calibration: P(positive score=s) ≈ s

Data Drift

Monitors change in input distribution
Triggers retraining if drift > threshold

Metric Choice Examples

Fraud Detection (Stripe)

Primary: Precision (false positives block legitimate transactions)
Secondary: Recall (false negatives = fraud loss)
Metric: PR-AUC (data is 99.9% legitimate, 0.1% fraud)
Target: 95% precision, 75% recall

Disease Screening (Healthcare)

Primary: Recall (miss = death)
Secondary: Precision (false positive = unnecessary treatment)
Metric: Sensitivity (recall), Specificity (TNR)
Target: >99% recall (catch all positives)

Product Recommendation (Amazon)

Primary: Conversion rate (what % buy after recommendation?)
Secondary: NDCG@5 (ranking quality)
Metric: A/B testing + NDCG
Target: Lift >2% vs baseline

Machine Translation (Google Translate)

Primary: BLEU, METEOR (correlation with human judgment)
Secondary: Inference latency (<500ms)
Production: Human raters score translation quality 1-5, correlation with BLEU ~0.5

Implementation

        
      
from sklearn.metrics import *

# Classification
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]

accuracy_score(y_true, y_pred)                    # 0.8
precision_score(y_true, y_pred)                   # 1.0
recall_score(y_true, y_pred)                      # 0.67
f1_score(y_true, y_pred)                          # 0.8
roc_auc_score(y_true, y_pred_proba)               # Area under ROC

# Regression
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mean_absolute_error(y_true, y_pred)               # 0.5
mean_squared_error(y_true, y_pred)                # 0.375
np.sqrt(mean_squared_error(y_true, y_pred))       # RMSE = 0.612
r2_score(y_true, y_pred)                          # 0.948

# Clustering (no labels)
from sklearn.metrics import silhouette_score
silhouette_score(X, labels)                       # -1 to 1

References

📄 Precision and Recall (Wikipedia) 📄 Receiver Operating Characteristic (Wikipedia) 📄 BLEU: A Method for Automatic Evaluation (Papineni et al., 2002) 📄 ROUGE: A Package for Automatic Evaluation (Lin, 2004) 📖 Scikit-learn Metrics 🎥 Fast.ai: Metrics & Validation

AI & Agents, ML Foundations

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.