Machine Learning Models & Learning Paradigms
Choosing the right learning paradigm matters more than choosing a specific algorithm. The paradigm determines what type of data you need and what you can learn.
Key insight: Choosing the right learning paradigm matters more than choosing a specific algorithm. The paradigm determines what type of data you need and what you can learn.
Learning Paradigms at a Glance
| Paradigm | Data | Problem Type | Best For | Example |
|---|---|---|---|---|
| Supervised | Labeled (X, y) | Prediction | Regression, classification | Predicting house prices, email spam |
| Unsupervised | Unlabeled (X only) | Discovery | Clustering, dimensionality reduction | Customer segmentation, anomaly detection |
| Reinforcement | Reward signals | Sequential decision-making | Control, games | Self-driving cars, game AI (AlphaGo) |
| Semi-supervised | Mostly unlabeled + some labeled | When labeling is expensive | Low-data scenarios | Medical imaging (few labeled scans) |
| Transfer | Pre-trained model | Leverage existing knowledge | Accelerate training | Fine-tuning BERT on custom task |
1. Supervised Learning
Definition: Learn from labeled examples (X, y pairs) where y is the ground truth. Goal: predict y for new, unseen X.
Classification (Predict Categories)
Definition: Predict which discrete category an instance belongs to.
Common Algorithms:
| Algorithm | Time (Training) | Space | Best For | Interpretability | Scalability |
|---|---|---|---|---|---|
| Logistic Regression | O(n×d×iterations) | O(d) | Binary classification, baseline | Very high | Excellent |
| Decision Trees | O(n log n × d) | O(n) | Interpretable rules, small data | Perfect | Poor |
| Random Forest | O(trees × n log n × d) | O(trees × n) | Balanced performance, non-linear | Medium | Good |
| Gradient Boosting (XGBoost) | O(trees × n log n × d) | O(trees × n) | Tabular data, Kaggle competitions | Low | Good |
| SVM | O(n² to n³) | O(support vectors) | High-dimensional, small-medium data | Low | Medium |
| Naive Bayes | O(n×d) | O(classes × d) | Text classification, spam filtering | High | Excellent |
| k-NN | O(n×d) for each prediction | O(n×d) | Small datasets, non-parametric | Perfect | Poor |
| Neural Networks (MLP) | O(n × layers × hidden_size) | O(n_params) | Complex non-linear patterns | Very low | Excellent |
Real-World Examples:
- LinkedIn: Predicts job recommendations for millions of users. Uses ensemble of models (gradient boosting + neural networks).
- Stripe: Detects fraudulent transactions in real-time. Logistic regression for baseline, gradient boosting for production. <50ms latency required.
- Zillow: Predicts house values (Zestimate). Gradient boosting over 100+ features. 95% of predictions within 20% of actual price.
Regression (Predict Continuous Values)
Definition: Predict a continuous numerical output (not discrete categories).
Common Algorithms:
| Algorithm | Use Case | Accuracy | Speed |
|---|---|---|---|
| Linear Regression | Simple relationships | Baseline | Fast |
| Polynomial Regression | Non-linear but smooth | Better | Medium |
| Ridge/Lasso | Prevent overfitting | Better | Fast |
| Decision Tree Regression | Non-linear, interpretable | Good | Medium |
| Random Forest Regression | Robust, non-linear | Very good | Good |
| Gradient Boosting Regression | State-of-the-art tabular | Excellent | Good |
| Neural Networks | Complex, high-dimensional | Excellent | Depends on size |
Production Examples:
- Uber: Predicts delivery time based on traffic, weather, driver location. Gradient boosting. Updates every 5 minutes.
- Amazon: Demand forecasting—predicts inventory needs. ARIMA + LSTM hybrid for seasonality + trends.
- Tesla: Predicts remaining battery range. Uses vehicle telemetry + weather + route. Critical for UX.
2. Unsupervised Learning
Definition: Learn from unlabeled data—discover hidden structure or patterns without ground truth.
Clustering
Definition: Group similar instances together. No predefined labels.
| Algorithm | Time | Space | Clusters | Best For |
|---|---|---|---|---|
| K-Means | O(n×k×iterations) | O(n+k) | K (you choose) | Fast, scalable, convex clusters |
| Hierarchical | O(n²) | O(n²) | Any | Dendrograms, visualizable, small data |
| DBSCAN | O(n log n) with spatial index | O(n) | Auto-detected | Non-convex shapes, outlier detection |
| Gaussian Mixture Models | O(n×k×iterations) | O(n+k×d) | Probabilistic | Soft assignments, probabilistic framework |
Real-World Applications:
- Netflix: Customer segmentation. Clusters users by watch behavior. Personalizes recommendations per segment.
- Spotify: Song clustering by audio features. Groups similar songs for playlist generation and recommendation.
- E-commerce: Market segmentation. Groups customers by purchase history, demographics. Enables targeted marketing.
- Genomics: Gene expression clustering. Groups genes with similar expression patterns to discover functional relationships.
Dimensionality Reduction
Definition: Reduce the number of features while preserving important information.
Algorithms:
| Algorithm | Input | Output | Use |
|---|---|---|---|
| Principal Component Analysis (PCA) | d features | k components | Reduce features for visualization, speed up training |
| t-SNE | d features | 2–3 components | Visualization only (not for prediction) |
| UMAP | d features | k components | Visualization + preserves local structure |
| Autoencoders | d features | k dimensions (learned) | Non-linear dimensionality reduction |
Production Use:
- Google Images: Reduces high-dimensional image features for fast nearest-neighbor search. PCA preprocesses embeddings.
- Recommendation Systems: Latent factor models reduce user/item features for efficient collaborative filtering.
Anomaly Detection
Definition: Find unusual patterns or outliers that deviate from normal behavior.
Algorithms:
- Isolation Forest: Isolates anomalies in random forests. Fast, doesn’t need to model normal distribution.
- Local Outlier Factor (LOF): Compares density of neighbors. Good for local anomalies.
- Autoencoders: Trains on normal data. High reconstruction error = anomaly.
- One-Class SVM: Learns boundary around normal data.
Examples:
- Credit Card Fraud: Real-time detection of suspicious transactions. ~0.1% fraud rate. Must catch fraud while minimizing false positives.
- Manufacturing: Detects equipment failures from sensor data. Predictive maintenance saves millions in downtime.
- Network Security: Detects intrusions by identifying unusual traffic patterns. NSA, major banks.
3. Reinforcement Learning (RL)
Definition: Learn by interacting with an environment. Receive rewards/penalties for actions. Goal: learn a policy that maximizes cumulative reward.
Key Concepts:
- Agent: The learner (e.g., robot, game AI)
- Environment: World the agent acts in (e.g., game, warehouse)
- State: Current situation
- Action: What agent can do
- Reward: Feedback signal (positive for good actions, negative for bad)
- Policy: Strategy—mapping from state to action
Algorithm Families:
| Type | Algorithm | How | Best For |
|---|---|---|---|
| Value-Based | Q-Learning, DQN | Learn value of each action in each state | Games with discrete actions |
| Policy-Based | Policy Gradient, PPO, Actor-Critic | Learn policy directly | Continuous control (robotics) |
| Model-Based | Monte Carlo Tree Search, AlphaGo | Learn dynamics model, plan ahead | Games with perfect information |
Production Applications:
- AlphaGo (DeepMind): Defeated world Go champion Lee Sedol. Combines deep learning + tree search + RL. 19×19 board = 10^170 possible states (too large for brute force).
- Autonomous Vehicles (Waymo): RL agent learns to navigate traffic. Trained in simulation, deployed in real world.
- Portfolio Optimization (Finance): RL learns trading strategy. Rewards = profit, penalties = risk. Outperforms rule-based strategies.
- Robotics (Boston Dynamics): Learns bipedal locomotion through RL. Agent receives reward for forward progress, penalty for falling.
4. Transfer Learning
Definition: Leverage knowledge from one task to accelerate learning on a related task.
Pattern:
- Pre-train on large dataset (e.g., ImageNet with 1M images)
- Fine-tune on smaller, task-specific dataset (e.g., medical images with 500 labeled scans)
- Result: High performance with less data and faster training
Benefits:
- ✅ Reduces labeled data requirement by 10–100x
- ✅ Speeds up training from weeks to hours
- ✅ Improves performance on small datasets
- ✅ Enables few-shot and zero-shot learning
Real-World Examples:
- BERT (Google): Pre-trained on 3.3B words. Fine-tuning on specific NLP task (sentiment analysis, NER, QA) improves accuracy by 5–15% with 1000s of labels instead of millions.
- ResNet (Computer Vision): Pre-trained on ImageNet (1.2M labeled images, 1000 categories). Fine-tuning on medical imaging task (X-rays: 10K images) achieves 95%+ accuracy vs 75% training from scratch.
- GPT-4 Fine-tuning (OpenAI): Pre-trained on trillions of tokens. Fine-tuning on customer support corpus (50K examples) creates domain-specific chatbot in days, not months.
5. Semi-Supervised Learning
Definition: Use mostly unlabeled data + small amount of labeled data. Unlabeled data helps improve performance.
Techniques:
- Self-training: Train on labeled data, then use predictions on unlabeled data as pseudo-labels
- Consistency regularization: Unlabeled samples should produce consistent predictions under perturbations
- Generative models: Learn from unlabeled data, then fine-tune with labeled data
When It Shines:
- Medical imaging: Labeling X-rays is expensive ($1–10 per image). Semi-supervised can use 1000s of unlabeled images + 100s of labeled.
- Natural language: Can leverage massive web text (unlabeled) + small amount of annotation (labeled).
Implementation Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Supervised Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X, y = load_data() # Features, labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
# Unsupervised Learning
from sklearn.cluster import KMeans
X = load_unlabeled_data()
model = KMeans(n_clusters=5)
labels = model.fit_predict(X) # No y needed
# Reinforcement Learning
import gym
env = gym.make("CartPole-v1")
for episode in range(1000):
state = env.reset()
for step in range(500):
action = select_action(state) # Learned policy
state, reward, done, info = env.step(action)
update_policy(state, action, reward) # Learn from reward
References
📄 Supervised Learning Overview (scikit-learn docs) 📄 Unsupervised Learning (scikit-learn docs) 📖 Reinforcement Learning: An Introduction (Sutton & Barto) — Classic textbook 📄 Transfer Learning Survey (Zhuang et al., 2020) 🎥 Deep Reinforcement Learning (UC Berkeley CS 285) — Industry standard course 🔗 Scikit-learn Documentation — Comprehensive ML library