Post

Machine Learning Models & Learning Paradigms

Choosing the right learning paradigm matters more than choosing a specific algorithm. The paradigm determines what type of data you need and what you can learn.

Machine Learning Models & Learning Paradigms

Key insight: Choosing the right learning paradigm matters more than choosing a specific algorithm. The paradigm determines what type of data you need and what you can learn.

Learning Paradigms at a Glance

Paradigm Data Problem Type Best For Example
Supervised Labeled (X, y) Prediction Regression, classification Predicting house prices, email spam
Unsupervised Unlabeled (X only) Discovery Clustering, dimensionality reduction Customer segmentation, anomaly detection
Reinforcement Reward signals Sequential decision-making Control, games Self-driving cars, game AI (AlphaGo)
Semi-supervised Mostly unlabeled + some labeled When labeling is expensive Low-data scenarios Medical imaging (few labeled scans)
Transfer Pre-trained model Leverage existing knowledge Accelerate training Fine-tuning BERT on custom task

1. Supervised Learning

Definition: Learn from labeled examples (X, y pairs) where y is the ground truth. Goal: predict y for new, unseen X.

Classification (Predict Categories)

Definition: Predict which discrete category an instance belongs to.

Common Algorithms:

Algorithm Time (Training) Space Best For Interpretability Scalability
Logistic Regression O(n×d×iterations) O(d) Binary classification, baseline Very high Excellent
Decision Trees O(n log n × d) O(n) Interpretable rules, small data Perfect Poor
Random Forest O(trees × n log n × d) O(trees × n) Balanced performance, non-linear Medium Good
Gradient Boosting (XGBoost) O(trees × n log n × d) O(trees × n) Tabular data, Kaggle competitions Low Good
SVM O(n² to n³) O(support vectors) High-dimensional, small-medium data Low Medium
Naive Bayes O(n×d) O(classes × d) Text classification, spam filtering High Excellent
k-NN O(n×d) for each prediction O(n×d) Small datasets, non-parametric Perfect Poor
Neural Networks (MLP) O(n × layers × hidden_size) O(n_params) Complex non-linear patterns Very low Excellent

Real-World Examples:

  • LinkedIn: Predicts job recommendations for millions of users. Uses ensemble of models (gradient boosting + neural networks).
  • Stripe: Detects fraudulent transactions in real-time. Logistic regression for baseline, gradient boosting for production. <50ms latency required.
  • Zillow: Predicts house values (Zestimate). Gradient boosting over 100+ features. 95% of predictions within 20% of actual price.

Regression (Predict Continuous Values)

Definition: Predict a continuous numerical output (not discrete categories).

Common Algorithms:

Algorithm Use Case Accuracy Speed
Linear Regression Simple relationships Baseline Fast
Polynomial Regression Non-linear but smooth Better Medium
Ridge/Lasso Prevent overfitting Better Fast
Decision Tree Regression Non-linear, interpretable Good Medium
Random Forest Regression Robust, non-linear Very good Good
Gradient Boosting Regression State-of-the-art tabular Excellent Good
Neural Networks Complex, high-dimensional Excellent Depends on size

Production Examples:

  • Uber: Predicts delivery time based on traffic, weather, driver location. Gradient boosting. Updates every 5 minutes.
  • Amazon: Demand forecasting—predicts inventory needs. ARIMA + LSTM hybrid for seasonality + trends.
  • Tesla: Predicts remaining battery range. Uses vehicle telemetry + weather + route. Critical for UX.

2. Unsupervised Learning

Definition: Learn from unlabeled data—discover hidden structure or patterns without ground truth.

Clustering

Definition: Group similar instances together. No predefined labels.

Algorithm Time Space Clusters Best For
K-Means O(n×k×iterations) O(n+k) K (you choose) Fast, scalable, convex clusters
Hierarchical O(n²) O(n²) Any Dendrograms, visualizable, small data
DBSCAN O(n log n) with spatial index O(n) Auto-detected Non-convex shapes, outlier detection
Gaussian Mixture Models O(n×k×iterations) O(n+k×d) Probabilistic Soft assignments, probabilistic framework

Real-World Applications:

  • Netflix: Customer segmentation. Clusters users by watch behavior. Personalizes recommendations per segment.
  • Spotify: Song clustering by audio features. Groups similar songs for playlist generation and recommendation.
  • E-commerce: Market segmentation. Groups customers by purchase history, demographics. Enables targeted marketing.
  • Genomics: Gene expression clustering. Groups genes with similar expression patterns to discover functional relationships.

Dimensionality Reduction

Definition: Reduce the number of features while preserving important information.

Algorithms:

Algorithm Input Output Use
Principal Component Analysis (PCA) d features k components Reduce features for visualization, speed up training
t-SNE d features 2–3 components Visualization only (not for prediction)
UMAP d features k components Visualization + preserves local structure
Autoencoders d features k dimensions (learned) Non-linear dimensionality reduction

Production Use:

  • Google Images: Reduces high-dimensional image features for fast nearest-neighbor search. PCA preprocesses embeddings.
  • Recommendation Systems: Latent factor models reduce user/item features for efficient collaborative filtering.

Anomaly Detection

Definition: Find unusual patterns or outliers that deviate from normal behavior.

Algorithms:

  • Isolation Forest: Isolates anomalies in random forests. Fast, doesn’t need to model normal distribution.
  • Local Outlier Factor (LOF): Compares density of neighbors. Good for local anomalies.
  • Autoencoders: Trains on normal data. High reconstruction error = anomaly.
  • One-Class SVM: Learns boundary around normal data.

Examples:

  • Credit Card Fraud: Real-time detection of suspicious transactions. ~0.1% fraud rate. Must catch fraud while minimizing false positives.
  • Manufacturing: Detects equipment failures from sensor data. Predictive maintenance saves millions in downtime.
  • Network Security: Detects intrusions by identifying unusual traffic patterns. NSA, major banks.

3. Reinforcement Learning (RL)

Definition: Learn by interacting with an environment. Receive rewards/penalties for actions. Goal: learn a policy that maximizes cumulative reward.

Key Concepts:

  • Agent: The learner (e.g., robot, game AI)
  • Environment: World the agent acts in (e.g., game, warehouse)
  • State: Current situation
  • Action: What agent can do
  • Reward: Feedback signal (positive for good actions, negative for bad)
  • Policy: Strategy—mapping from state to action

Algorithm Families:

Type Algorithm How Best For
Value-Based Q-Learning, DQN Learn value of each action in each state Games with discrete actions
Policy-Based Policy Gradient, PPO, Actor-Critic Learn policy directly Continuous control (robotics)
Model-Based Monte Carlo Tree Search, AlphaGo Learn dynamics model, plan ahead Games with perfect information

Production Applications:

  • AlphaGo (DeepMind): Defeated world Go champion Lee Sedol. Combines deep learning + tree search + RL. 19×19 board = 10^170 possible states (too large for brute force).
  • Autonomous Vehicles (Waymo): RL agent learns to navigate traffic. Trained in simulation, deployed in real world.
  • Portfolio Optimization (Finance): RL learns trading strategy. Rewards = profit, penalties = risk. Outperforms rule-based strategies.
  • Robotics (Boston Dynamics): Learns bipedal locomotion through RL. Agent receives reward for forward progress, penalty for falling.

4. Transfer Learning

Definition: Leverage knowledge from one task to accelerate learning on a related task.

Pattern:

  1. Pre-train on large dataset (e.g., ImageNet with 1M images)
  2. Fine-tune on smaller, task-specific dataset (e.g., medical images with 500 labeled scans)
  3. Result: High performance with less data and faster training

Benefits:

  • ✅ Reduces labeled data requirement by 10–100x
  • ✅ Speeds up training from weeks to hours
  • ✅ Improves performance on small datasets
  • ✅ Enables few-shot and zero-shot learning

Real-World Examples:

  • BERT (Google): Pre-trained on 3.3B words. Fine-tuning on specific NLP task (sentiment analysis, NER, QA) improves accuracy by 5–15% with 1000s of labels instead of millions.
  • ResNet (Computer Vision): Pre-trained on ImageNet (1.2M labeled images, 1000 categories). Fine-tuning on medical imaging task (X-rays: 10K images) achieves 95%+ accuracy vs 75% training from scratch.
  • GPT-4 Fine-tuning (OpenAI): Pre-trained on trillions of tokens. Fine-tuning on customer support corpus (50K examples) creates domain-specific chatbot in days, not months.

5. Semi-Supervised Learning

Definition: Use mostly unlabeled data + small amount of labeled data. Unlabeled data helps improve performance.

Techniques:

  • Self-training: Train on labeled data, then use predictions on unlabeled data as pseudo-labels
  • Consistency regularization: Unlabeled samples should produce consistent predictions under perturbations
  • Generative models: Learn from unlabeled data, then fine-tune with labeled data

When It Shines:

  • Medical imaging: Labeling X-rays is expensive ($1–10 per image). Semi-supervised can use 1000s of unlabeled images + 100s of labeled.
  • Natural language: Can leverage massive web text (unlabeled) + small amount of annotation (labeled).

Implementation Patterns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Supervised Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X, y = load_data()  # Features, labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)

# Unsupervised Learning
from sklearn.cluster import KMeans

X = load_unlabeled_data()
model = KMeans(n_clusters=5)
labels = model.fit_predict(X)  # No y needed

# Reinforcement Learning
import gym

env = gym.make("CartPole-v1")
for episode in range(1000):
    state = env.reset()
    for step in range(500):
        action = select_action(state)  # Learned policy
        state, reward, done, info = env.step(action)
        update_policy(state, action, reward)  # Learn from reward

References

📄 Supervised Learning Overview (scikit-learn docs) 📄 Unsupervised Learning (scikit-learn docs) 📖 Reinforcement Learning: An Introduction (Sutton & Barto) — Classic textbook 📄 Transfer Learning Survey (Zhuang et al., 2020) 🎥 Deep Reinforcement Learning (UC Berkeley CS 285) — Industry standard course 🔗 Scikit-learn Documentation — Comprehensive ML library

This post is licensed under CC BY 4.0 by the author.